# 04. Population Data Ingestion

## Why Population Data Matters for Fire Prediction

Population data is crucial for wildfire risk assessment because:

### **Human Fire Causes**
- **90% of wildfires** are human-caused (campfires, equipment, arson, etc.)
- Higher population density = higher probability of human ignition
- Urban areas have more ignition sources (power lines, vehicles, etc.)

### **Wildland-Urban Interface (WUI)**
- Where people meet forests = highest fire risk areas
- Evacuation planning requires knowing how many people live in fire-prone areas
- Resource allocation prioritizes protecting populated areas

### **Fire Suppression & Response**
- More people = more fire stations and resources nearby
- Population density affects emergency response times
- Evacuation routes depend on population distribution

### **Economic Impact**
- Property damage scales with population density
- Insurance costs and fire suppression budgets correlate with population
- Business disruption affects more people in dense areas

## Data Source
- **Source**: US Census Bureau API
- **Coverage**: California counties (2000-2024)
- **Update Frequency**: Annual (Decennial Census + American Community Survey)
- **API Documentation**: https://www.census.gov/data/developers/data-sets.html

## Objectives
1. Load and validate California population data
2. Explore population trends and distributions
3. Calculate population density metrics
4. Identify high-risk Wildland-Urban Interface areas
5. Prepare population features for ML model


## Import Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("üìö Libraries imported successfully!")


üìö Libraries imported successfully!


## Load Population Data


In [2]:
# Load the combined population dataset
data_path = Path("../data/raw/population")
population_file = data_path / "california_population_combined.csv"

if population_file.exists():
    df_population = pd.read_csv(population_file)
    print(f"‚úÖ Loaded population data: {population_file}")
    print(f"üìä Shape: {df_population.shape}")
else:
    print(f"‚ùå Population data not found at {population_file}")
    print("   Run the download script first: python scripts/download_population_data.py")


‚úÖ Loaded population data: ../data/raw/population/california_population_combined.csv
üìä Shape: (406, 7)


## Summary for ML Model

### **Dataset Overview**
- **Source**: US Census Bureau API
- **Coverage**: California counties (2010-2024)
- **Records**: 290 county-year combinations
- **Counties**: 58 (all California counties)
- **Years**: 2010, 2015, 2020, 2022, 2024

### **Core Features Available**
1. **`total_population`** - Total population count per county
2. **`male_population`** - Male population count
3. **`female_population`** - Female population count
4. **`county_name`** - California county name
5. **`county_fips`** - Federal Information Processing Standard code
6. **`state_fips`** - California state FIPS code (06)
7. **`year`** - Data collection year

### **Derived Features for ML**
1. **`gender_ratio`** - Male/Female population ratio
2. **`population_category`** - Rural/Suburban/Urban classification
3. **`population_growth_rate`** - Year-over-year population change (%)
4. **`wui_risk`** - Wildland-Urban Interface risk level (High/Medium/Low)

### **Fire Prediction Relevance**
- **Human Ignition**: 90% of wildfires are human-caused
- **Population Density**: Higher density = more ignition sources
- **WUI Assessment**: Identifies high-risk areas where people meet forests
- **Evacuation Planning**: Population counts for emergency response
- **Resource Allocation**: Prioritize fire protection in populated areas

### **Data Quality**
- ‚úÖ No missing values
- ‚úÖ All 58 California counties included
- ‚úÖ Reasonable population ranges (no negative or unrealistic values)
- ‚úÖ Consistent data across all years


In [3]:
# Display basic information about the population dataset
print("üîç Population Data Overview")
print("=" * 40)
print(f"Shape: {df_population.shape}")
print(f"Years covered: {sorted(df_population['year'].unique())}")
print(f"Counties: {df_population['county_name'].nunique()}")
print(f"Columns: {list(df_population.columns)}")

print("\nüìã Data Types:")
print(df_population.dtypes)

print("\nüìä First 5 rows:")
df_population.head()


üîç Population Data Overview
Shape: (406, 7)
Years covered: [2010, 2012, 2015, 2018, 2020, 2022, 2024]
Counties: 58
Columns: ['total_population', 'male_population', 'female_population', 'state_fips', 'county_fips', 'year', 'county_name']

üìã Data Types:
total_population      int64
male_population       int64
female_population     int64
state_fips            int64
county_fips           int64
year                  int64
county_name          object
dtype: object

üìä First 5 rows:


Unnamed: 0,total_population,male_population,female_population,state_fips,county_fips,year,county_name
0,1663823,826561,837262,6,1,2010,Alameda
1,1515,882,633,6,3,2010,Alpine
2,40577,22007,18570,6,5,2010,Amador
3,213605,106376,107229,6,7,2010,Butte
4,45674,22749,22925,6,9,2010,Calaveras
