# Data Merging & Preparation for Wildfire Risk Prediction

**Objective**: Merge all processed datasets into a unified dataset for ML model training

**Datasets to Merge**:
1. **Fire Data**: `fire_data_complete_unit_month.csv` - Fire occurrences by unit-month
2. **Weather Data**: `unified_county_weather_2000_2025.csv` - County-level weather (43 counties)
3. **Drought Data**: `drought_data_monthly_2000_2025.csv` - Statewide monthly drought
4. **Population Data**: `population_data_long_format_2000_2025.csv` - County population (58 counties)
5. **Topography Data**: `terrain_data_timeseries_2000_2025.csv` - County terrain features

**Merging Strategy**:
- **Primary Key**: County + Year + Month
- **Fire Data**: Use Unit_ID mapping to counties
- **Weather Data**: 43 counties with complete weather data
- **Drought Data**: Statewide data applied to all counties
- **Population Data**: 58 counties with yearly data (no Month column)
- **Topography Data**: 58 counties with terrain features

**Expected Output**: Unified dataset ready for ML model training

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

print('✅ Libraries imported successfully!')

  from pandas.core import (


✅ Libraries imported successfully!


## 1. Load All Processed Datasets

In [2]:
# Load all processed datasets
print('=' * 80)
print('LOADING ALL PROCESSED DATASETS')
print('=' * 80)

data_dir = Path('../data/processed')

# 1. Fire Data
print('\n🔥 Loading Fire Data...')
fire_file = data_dir / 'fire_data_complete_unit_month.csv'
fire_df = pd.read_csv(fire_file)
print(f'   Records: {len(fire_df):,}')
print(f'   Columns: {list(fire_df.columns)}')
print(f'   Date range: {fire_df["Year"].min()}-{fire_df["Year"].max()}')

# 2. Weather Data
print('\n🌤️ Loading Weather Data...')
weather_file = data_dir / 'unified_county_weather_2000_2025.csv'
weather_df = pd.read_csv(weather_file)
print(f'   Records: {len(weather_df):,}')
print(f'   Columns: {list(weather_df.columns)}')
print(f'   Counties: {weather_df["County"].nunique()}')

# 3. Drought Data
print('\n🌵 Loading Drought Data...')
drought_file = data_dir / 'drought_data_monthly_2000_2025.csv'
drought_df = pd.read_csv(drought_file)
print(f'   Records: {len(drought_df):,}')
print(f'   Columns: {list(drought_df.columns)}')
print(f'   Date range: {drought_df["Year"].min()}-{drought_df["Year"].max()}')

# 4. Population Data
print('\n👥 Loading Population Data...')
pop_file = data_dir / 'population_data_long_format_2000_2025.csv'
pop_df = pd.read_csv(pop_file)
print(f'   Records: {len(pop_df):,}')
print(f'   Columns: {list(pop_df.columns)}')
print(f'   Counties: {pop_df["COUNTY"].nunique()}')

# 5. Topography Data
print('\n⛰️ Loading Topography Data...')
terrain_file = data_dir / 'terrain_data_timeseries_2000_2025.csv'
terrain_df = pd.read_csv(terrain_file)
print(f'   Records: {len(terrain_df):,}')
print(f'   Columns: {list(terrain_df.columns)}')
print(f'   Counties: {terrain_df["County"].nunique()}')

print('\n✅ All datasets loaded successfully!')
print('=' * 80)

LOADING ALL PROCESSED DATASETS

🔥 Loading Fire Data...
   Records: 34,008
   Columns: ['Unit_ID', 'Year', 'Month', 'Fire_Count', 'Total_Acres', 'Avg_Acres', 'Max_Acres', 'Fire_Occurred', 'County']
   Date range: 2000-2025

🌤️ Loading Weather Data...
   Records: 423,808
   Columns: ['County', 'Year', 'Month', 'Avg_Temp', 'Max_Temp', 'Min_Temp', 'Precipitation']
   Counties: 43

🌵 Loading Drought Data...
   Records: 310
   Columns: ['Year', 'Month', 'None', 'D0', 'D1', 'D2', 'D3', 'D4', 'Drought_Intensity_Score', 'Severe_Drought_Area', 'Has_D0', 'Has_D1', 'Has_D2', 'Has_D3', 'Has_D4']
   Date range: 2000-2025

👥 Loading Population Data...
   Records: 1,508
   Columns: ['COUNTY', 'Year', 'Population']
   Counties: 58

⛰️ Loading Topography Data...
   Records: 18,096
   Columns: ['Year', 'Month', 'County', 'Mean_Elevation', 'Max_Elevation', 'Min_Elevation', 'Mean_Slope', 'Max_Slope', 'Mean_Aspect', 'Terrain_Roughness']
   Counties: 58

✅ All datasets loaded successfully!


## 2. Data Merging Strategy

In [3]:
# Prepare datasets for merging
print('=' * 80)
print('PREPARING DATASETS FOR MERGING')
print('=' * 80)

# Fire data: Already has County, Year, Month
fire_merge = fire_df.copy()
print(f'   Fire data: {fire_merge.shape} - Ready for merge')

# Weather data: Already has County, Year, Month
weather_merge = weather_df.copy()
print(f'   Weather data: {weather_merge.shape} - Ready for merge')

# Population data: Rename COUNTY to County (NO Month column)
pop_merge = pop_df.copy()
pop_merge = pop_merge.rename(columns={'COUNTY': 'County'})
print(f'   Population data: {pop_merge.shape} - Ready for merge (Yearly data)')

# Terrain data: Already has County, Year, Month
terrain_merge = terrain_df.copy()
print(f'   Terrain data: {terrain_merge.shape} - Ready for merge')

# Drought data: Statewide data
drought_merge = drought_df.copy()
print(f'   Drought data: {drought_merge.shape} - Statewide (will merge separately)')

print('\n✅ All datasets prepared for merging!')
print('=' * 80)

PREPARING DATASETS FOR MERGING
   Fire data: (34008, 9) - Ready for merge
   Weather data: (423808, 7) - Ready for merge
   Population data: (1508, 3) - Ready for merge (Yearly data)
   Terrain data: (18096, 10) - Ready for merge
   Drought data: (310, 15) - Statewide (will merge separately)

✅ All datasets prepared for merging!


## 3. Perform Merging

In [4]:
# Start merging process
print('=' * 80)
print('MERGING ALL DATASETS')
print('=' * 80)

# Start with Fire Data as base
print('\n🔥 Starting with Fire Data as base...')
merged = fire_merge.copy()
print(f'   Base dataset: {merged.shape}')

# Merge with Weather data
print('\n🌤️ Merging Weather Data...')
merged = merged.merge(
    weather_merge,
    on=['County', 'Year', 'Month'],
    how='left'
)
print(f'   After weather merge: {merged.shape}')

# Merge with Population data (NO Month column)
print('\n👥 Merging Population Data...')
merged = merged.merge(
    pop_merge,
    on=['County', 'Year'],
    how='left'
)
print(f'   After population merge: {merged.shape}')

# Merge with Terrain data
print('\n⛰️ Merging Terrain Data...')
merged = merged.merge(
    terrain_merge,
    on=['County', 'Year', 'Month'],
    how='left'
)
print(f'   After terrain merge: {merged.shape}')

# Add Drought data (statewide)
print('\n🌵 Adding Drought Data (statewide)...')
drought_merge_wide = drought_merge.set_index(['Year', 'Month'])
for col in drought_merge.columns:
    if col not in ['Year', 'Month']:
        merged[col] = merged.apply(
            lambda row: drought_merge_wide.loc[(row['Year'], row['Month']), col]
            if (row['Year'], row['Month']) in drought_merge_wide.index
            else np.nan, axis=1
        )
print(f'   After drought merge: {merged.shape}')

print('\n✅ All datasets merged successfully!')
print('=' * 80)

MERGING ALL DATASETS

🔥 Starting with Fire Data as base...
   Base dataset: (34008, 9)

🌤️ Merging Weather Data...
   After weather merge: (1017452, 13)

👥 Merging Population Data...
   After population merge: (1017452, 14)

⛰️ Merging Terrain Data...
   After terrain merge: (1017452, 21)

🌵 Adding Drought Data (statewide)...
   After drought merge: (1017452, 34)

✅ All datasets merged successfully!


## 4. Final Data Preparation & Export

In [5]:
# Save the merged dataset
print('=' * 80)
print('SAVING MERGED DATASET')
print('=' * 80)

# Save to CSV
output_file = data_dir / 'merged_wildfire_dataset_2000_2025.csv'
merged.to_csv(output_file, index=False)

print(f'💾 Saved merged dataset to: {output_file}')
print(f'   File size: {output_file.stat().st_size / (1024*1024):.2f} MB')
print(f'   Records: {len(merged):,}')
print(f'   Columns: {merged.shape[1]}')

# Final summary
print(f'\n📊 FINAL DATASET SUMMARY:')
print(f'   Total records: {len(merged):,}')
print(f'   Unique counties: {merged["County"].nunique()}')
print(f'   Date range: {merged["Year"].min()}-{merged["Year"].max()}')
print(f'   Fire occurrence rate: {(merged["Fire_Occurred"] == 1).mean()*100:.2f}%')

print(f'\n🎯 Merged dataset ready for ML model training!')
print('=' * 80)

SAVING MERGED DATASET
💾 Saved merged dataset to: ../data/processed/merged_wildfire_dataset_2000_2025.csv
   File size: 129.19 MB
   Records: 1,017,452
   Columns: 34

📊 FINAL DATASET SUMMARY:
   Total records: 1,017,452
   Unique counties: 43
   Date range: 2000-2025
   Fire occurrence rate: 10.77%

🎯 Merged dataset ready for ML model training!


## 5. Initial Exploratory Data Analysis (EDA)

In [6]:
# Basic data exploration
print('=' * 80)
print('INITIAL DATA EXPLORATION')
print('=' * 80)

# Check for missing values
print('\n🔍 Missing Values Analysis:')
missing_data = merged.isnull().sum()
missing_percent = (missing_data / len(merged)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

print(missing_df[missing_df['Missing Count'] > 0])

# Data types
print('\n📋 Data Types:')
print(merged.dtypes)

# Basic statistics
print('\n📊 Basic Statistics:')
print(merged.describe())

print('\n✅ Initial exploration complete!')
print('=' * 80)

INITIAL DATA EXPLORATION

🔍 Missing Values Analysis:
                         Missing Count  Missing Percentage
Mean_Slope                     1017452          100.000000
Terrain_Roughness              1017452          100.000000
Mean_Aspect                    1017452          100.000000
Max_Slope                      1017452          100.000000
Min_Elevation                  1017452          100.000000
Max_Elevation                  1017452          100.000000
Mean_Elevation                 1017452          100.000000
Population                     1017452          100.000000
Max_Temp                          2284            0.224482
Precipitation                     2284            0.224482
Min_Temp                          2284            0.224482
Avg_Temp                          2284            0.224482
County                            1872            0.183989
D1                                 218            0.021426
Drought_Intensity_Score            218            0.021426
Sev