## CPSC 4820 Project - Fire Intensity Prediction using Machine Learning

#### Datasource
+ https://cwfis.cfs.nrcan.gc.ca/downloads/hotspots/archive/ 

### Project Outline
+ Dataset Information
+ Data Preprocessing
+ Exploratory Descriptive Analysis
+ Machiine Learning Model Development
+ Prediction/Result
+ Evaluating the result/metrics
+ Conclusion
  

#### About Dataset
+ Datasource:
    - https://cwfis.cfs.nrcan.gc.ca/downloads/hotspots/archive/
  
+ Description:
    - The data collected originally is from various data sources like USFS, NASA, NOAA satellite sensors like VIIRS, MODIS, AVHRR and satellites used are Terra, METOP-B, NOAA-19, Aqua, S-NPP, NOAA 15 etc. 
    
+ Metadata:
    - The dataset is in .DBF format and python script was used to convert the data into .csv format for each year.
    - It has 33 fields or attributes. 
    
+ Attribute Information:
   - LAT - latitude in decimal degrees 
   - LON - longitude in decimal degrees 
   - REP_DATE - date and time of detection 
   - UID - unique id 
   - SOURCE - data source 
   - SENSOR - satellite sensor 
   - SATELLITE - satellite name 
   - AGENCY - province/territory in which hotspot is located 
   - TEMP - noon temperature (in degrees) at hotspot location 
   - RH - local noon relative humidity (%) at hotspot location 
   - WS - local noon wind speed (km/h) at hotspot location 
   - WD - local noon wind direction (degrees) at hotspot location 
   - PCP - local noon 24-hour precipitation (mm) at hotspot location 
   - FFMC - fine fuel moisture code at hotspot location 
   - DMC - duff moisture code at hotspot location 
   - DC - drought code at hotspot location 
   - ISI - Initial spread index at hotspot location 
   - BUI - buildup index at hotspot location 
   - FWI - fire weather index at hotspot location 
   - FUEL - FBP fuel type at hotspot location 
   - ROS - rate of spread (m/min) at hotspot location (modelled) 
   - SFC - surface fuel consumption (kg/m2) at hotspot location (modelled) 
   - TFC - total fuel consumption (kg/m2) at hotspot location (modelled) 
   - HFI - head fire intensity (kW/m) at hotspot location (modelled) 
   - CFB - crown fraction burned (%) at hotspot location (modelled) 
   - PCURING - percent curing 
   - ECOZONE - ecozone in which hotspot is located 
   - ESTAREA - approximate burned area based on historical average area burned per hotspot by agency and fuel type 
   - CFACTOR - percent curing 
   - GREENUP - curing factor 
   - ELEV - elevation above sea level (meters) 

### Load Packages

In [46]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

### Preprocessing

In [112]:
hotspots = pd.read_csv("data/combined_hotspots.csv")

In [113]:
hotspots['rep_date'] = pd.to_datetime(hotspots['rep_date'], errors='coerce')
hotspots['year'] = hotspots['rep_date'].dt.year
hotspots['month'] = hotspots['rep_date'].dt.month
hotspots['day'] = hotspots['rep_date'].dt.day

In [114]:
print(hotspots['year'].value_counts())

2023    696595
2021    239067
2022     52420
2019     33087
2020     12721
Name: year, dtype: int64


#### Sampling

In [115]:
years_of_interest = [2019, 2020, 2021, 2022, 2023]
hotspots = hotspots[hotspots['year'].isin(years_of_interest)]

# Perform stratified sampling to get a representative sample
sampled_df, _ = train_test_split(hotspots, test_size=0.95, stratify=hotspots['year'], random_state=1)

# Display the shape of the sampled DataFrame to verify
print(f"Sampled data shape: {sampled_df.shape}")

# Ensure each year is represented in the sampled data
print("Sampled data year distribution:")
print(sampled_df['year'].value_counts())

Sampled data shape: (51694, 40)
Sampled data year distribution:
2023    34830
2021    11953
2022     2621
2019     1654
2020      636
Name: year, dtype: int64


In [116]:
sampled_df.to_csv('data/sampled_hotspots', index=False)

#### Handling Outliers and Missing Values

In [117]:
# Assessing features with missing values
null_values = sampled_df.isnull().sum()
print("Columns with null values:\n", null_values[null_values > 0])

Columns with null values:
 source       34830
satellite     2486
ros              8
sfc              8
tfc              8
bfc          37451
hfi              8
cfb              8
estarea      50987
pcuring      34745
greenup      34745
sfl            189
tfc0             8
ecozone         11
sfc0            14
cbh          35019
uid          51058
fid          16864
dtype: int64


In [118]:
# Calculate the percentage of missing values in each column
missing_percentage = sampled_df.isnull().mean() * 100

# Convert the Series to a DataFrame for better readability
missing_percentage_df = missing_percentage.reset_index()
missing_percentage_df.columns = ['Column', 'Missing Percentage']

# Filter the DataFrame for missing percentages greater than 0
missing_percentage_df = missing_percentage_df[missing_percentage_df['Missing Percentage'] > 0]
missing_percentage_df

Unnamed: 0,Column,Missing Percentage
3,source,67.377258
5,satellite,4.809069
19,ros,0.015476
20,sfc,0.015476
21,tfc,0.015476
22,bfc,72.447479
23,hfi,0.015476
24,cfb,0.015476
25,estarea,98.632336
26,pcuring,67.212829


In [119]:
#Fill up the missing satelite information with 'Unknown'
sampled_df['satellite'].fillna('unknown', inplace=True)

In [120]:
# Identify rows with missing values in the 'sfc' column
sfc_missing_indices = sampled_df[sampled_df['sfc'].isnull()].index

In [121]:
# Compare with other columns
columns_to_check = ['tfc', 'hfi', 'cfb', 'tfc0', 'sfc0']
missing_in_all = sampled_df.loc[sfc_missing_indices, columns_to_check].isnull().all(axis=1)

# Check if all values are True (indicating the same rows are missing in all columns)
if missing_in_all.all():
    print("The rows with missing values in 'sfc' are the same as those in 'tfc', 'hfi', 'cfb', 'tfc0', and 'sfc0'.")
else:
    print("The rows with missing values in 'sfc' are NOT the same as those in 'tfc', 'hfi', 'cfb', 'tfc0', and 'sfc0'.")

# Display the indices where the missing values do not match
mismatched_indices = missing_in_all[~missing_in_all].index
print("Indices with mismatched missing values:\n", mismatched_indices)

The rows with missing values in 'sfc' are the same as those in 'tfc', 'hfi', 'cfb', 'tfc0', and 'sfc0'.
Indices with mismatched missing values:
 Int64Index([], dtype='int64')


In [122]:
sampled_df = sampled_df.drop(columns=['cbh', 'bfc','pcuring', 'greenup','source','estarea','fid','uid'])
sampled_df = sampled_df.dropna(subset=['sfl','tfc0','ecozone','sfc0'])

In [123]:
# Check for duplicate rows
num_duplicates = sampled_df.duplicated().sum()
sampled_df = sampled_df.drop_duplicates()

In [124]:
sampled_df = sampled_df[sampled_df['hfi'] != 0]

In [125]:
sampled_df = sampled_df.drop(columns=['tfc0', 'sfc0'])
sampled_df = sampled_df.drop(columns=['agency']) #drop since we only have BC

In [126]:
def categorize_intensity(hfi):
    if hfi <= 10:
        return 'Low'
    elif hfi <= 100:
        return 'Moderate'
    else:
        return 'High'

# Apply the function to create the new 'Intensity' feature
sampled_df['Intensity'] = sampled_df['hfi'].apply(categorize_intensity)

# Display the first few rows to verify
sampled_df[['hfi', 'Intensity']].head()

Unnamed: 0,hfi,Intensity
362494,6580.0,High
999473,7182.0,High
885880,1796.0,High
664445,34.0,Moderate
527583,1.0,Low


In [127]:
sampled_df = sampled_df[sampled_df['hfi'] <= 60000]

# Verify the filtering step
print(f"Number of rows after filtering 'hfi' > 60000: {sampled_df.shape[0]}")

Number of rows after filtering 'hfi' > 60000: 48838


In [128]:
sampled_df.to_csv('data/cleaned_hotspots.csv', index=False)

### Descriptive Analysis

In [129]:
clean_df = pd.read_csv("data/cleaned_hotspots.csv")

In [130]:
clean_df.head()

Unnamed: 0,lat,lon,rep_date,sensor,satellite,temp,rh,ws,wd,pcp,...,hfi,cfb,elev,sfl,cfl,ecozone,year,month,day,Intensity
0,50.923618,-122.94812,2023-08-21 19:47:00,VIIRS-I,NOAA-20,21.681,26,6.221,10,0.001,...,6580.0,22.0,1168,-1.0,1.493392,14.0,2023,8,21,High
1,59.328319,-121.481293,2023-09-23 09:39:00,VIIRS-I,NOAA-20,14.237001,43,6.112,63,0.0,...,7182.0,61.0,611,29.668455,0.237964,4.0,2023,9,23,High
2,58.67429,-121.57357,2023-08-28 21:50:00,VIIRS-I,S-NPP,32.043999,27,7.551,331,0.0,...,1796.0,0.0,459,30.668169,0.089391,4.0,2023,8,28,High
3,59.92577,-120.703651,2023-05-30 18:56:00,MODIS,Terra,15.977,23,15.333,249,0.237,...,34.0,0.0,529,33.270531,0.193294,4.0,2023,5,30,Moderate
4,54.862831,-125.730042,2023-07-16 10:24:00,VIIRS-I,S-NPP,16.922001,65,6.577,159,1.518,...,1.0,0.0,914,6.501198,0.696778,14.0,2023,7,16,Low


In [131]:
clean_df.shape

(48838, 30)

In [132]:
clean_df.columns

Index(['lat', 'lon', 'rep_date', 'sensor', 'satellite', 'temp', 'rh', 'ws',
       'wd', 'pcp', 'ffmc', 'dmc', 'dc', 'isi', 'bui', 'fwi', 'fuel', 'ros',
       'sfc', 'tfc', 'hfi', 'cfb', 'elev', 'sfl', 'cfl', 'ecozone', 'year',
       'month', 'day', 'Intensity'],
      dtype='object')

In [133]:
clean_df.dtypes

lat          float64
lon          float64
rep_date      object
sensor        object
satellite     object
temp         float64
rh             int64
ws           float64
wd             int64
pcp          float64
ffmc         float64
dmc          float64
dc           float64
isi          float64
bui          float64
fwi          float64
fuel          object
ros          float64
sfc          float64
tfc          float64
hfi          float64
cfb          float64
elev           int64
sfl          float64
cfl          float64
ecozone      float64
year           int64
month          int64
day            int64
Intensity     object
dtype: object

In [134]:
columns_to_describe = [
    'temp', 'rh', 'ws', 'wd', 'pcp', 'ffmc', 'dmc', 'dc', 
    'isi', 'bui', 'fwi', 'ros', 'sfc', 'tfc', 'hfi', 
    'cfb', 'elev', 'sfl', 'cfl'
]

clean_df[columns_to_describe].describe()

Unnamed: 0,temp,rh,ws,wd,pcp,ffmc,dmc,dc,isi,bui,fwi,ros,sfc,tfc,hfi,cfb,elev,sfl,cfl
count,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0,48838.0
mean,21.832041,35.400016,9.475066,200.133523,0.20759,89.992634,82.990481,528.14063,8.34015,115.140583,28.123408,5.601104,2.655998,2.840183,6028.430239,22.424833,927.511856,9.128397,0.794414
std,5.546177,11.458461,3.213379,91.0175,0.847562,5.758842,36.510222,143.200223,3.688466,41.170559,11.087424,5.738636,1.201498,1.409124,7684.808837,34.383702,391.301645,7.791424,0.61537
min,-17.797001,11.0,3.102,0.0,0.0,36.573002,0.0,0.0,0.027,0.0,0.016,0.01,0.02,0.02,1.0,0.0,-1.0,-1.0,0.0
25%,18.377001,27.0,7.081,135.0,0.0,89.541,60.267499,439.284012,5.881,91.114998,21.80525,1.4,2.03,2.04,842.0,0.0,609.0,5.138567,0.23569
50%,22.454,34.0,8.765,203.0,0.0,91.091003,73.6045,546.779999,8.261,108.914501,27.693001,3.89,2.99,3.076882,3267.0,0.0,910.0,7.119925,0.681534
75%,25.865999,41.0,11.16,275.0,0.004,93.024002,99.488998,622.76799,10.91075,130.950745,35.23675,8.04,3.44,3.67,8223.75,48.0,1187.0,11.943716,1.286852
max,40.594002,97.0,31.66,360.0,39.828999,97.850998,266.734985,1122.06006,26.660999,329.535004,68.861,95.870003,4.95,6.816985,59681.0,100.0,2540.0,37.321037,3.719171


In [135]:
clean_df['Intensity'].value_counts()

High        42848
Moderate     4714
Low          1276
Name: Intensity, dtype: int64

In [136]:
# Calculate the correlation matrix
corr_matrix = clean_df[columns_to_describe].corr()

# Find highly correlated pairs
high_corr_pairs = [(i, j) for i in corr_matrix.columns for j in corr_matrix.columns 
                   if i != j and abs(corr_matrix.loc[i, j]) > 0.9]

# Display the highly correlated pairs
print("Highly correlated pairs:")
for i, j in high_corr_pairs:
    print(f"{i} and {j} with correlation {corr_matrix.loc[i, j]:.2f}")

Highly correlated pairs:
dmc and bui with correlation 0.96
isi and fwi with correlation 0.94
bui and dmc with correlation 0.96
fwi and isi with correlation 0.94
sfc and tfc with correlation 0.97
tfc and sfc with correlation 0.97


+ dmc and bui (0.97):
  + Description: Duff Moisture Code (DMC) and Build Up Index (BUI).
  + Importance: Both indices are highly correlated because BUI is calculated using DMC. This strong positive correlation indicates that as DMC increases, BUI also increases almost proportionally.

+ isi and fwi (0.94):
  - Description: Initial Spread Index (ISI) and Fire Weather Index (FWI).
  - Importance: ISI is a component of FWI, leading to a strong positive correlation. Higher ISI values contribute to higher FWI values, which assess overall fire danger.

+ ros and hfi (0.92):
  - Description: Rate of Spread (ROS) and Head Fire Intensity (HFI).
  - Importance: ROS directly influences HFI. A higher rate of spread typically leads to greater fire intensity, resulting in a strong positive correlation.

+ ros and cfb (0.82):
  - Description: Rate of Spread (ROS) and Crown Fraction Burned (CFB).
  - Importance: A higher rate of spread can lead to more extensive crown burning, indicating a strong positive correlation.

+ sfc and tfc (0.98):
  - Description: Surface Fuel Consumption (SFC) and Total Fuel Consumption (TFC).
  - Importance: SFC contributes significantly to TFC, leading to a very strong positive correlation.

+ sfc and tfc0 (0.98):
  - Description: Surface Fuel Consumption (SFC) and Total Fuel Consumption under zero conditions (TFC0).
  - Importance: Similar to TFC, TFC0 also includes SFC, resulting in a strong positive correlation.

+ sfc and sfc0 (1.00):
  - Description: Surface Fuel Consumption (SFC) and Surface Fuel Consumption under zero conditions (SFC0).
  - Importance: These are essentially the same measure under different conditions, leading to a perfect correlation.

+ tfc and tfc0 (0.99):
  - Description: Total Fuel Consumption (TFC) and Total Fuel Consumption under zero conditions (TFC0).
  - Importance: Both measure similar aspects of fuel consumption, leading to a very strong positive correlation.

+ hfi and cfb (0.83):
  - Description: Head Fire Intensity (HFI) and Crown Fraction Burned (CFB).
  - Importance: Higher fire intensity often leads to greater burning of the forest canopy, showing a strong positive correlation.

+ General Insights:
  + Redundancy: Many of these highly correlated pairs suggest redundancy in the data. For example, SFC, TFC, TFC0, and SFC0 are all closely related, implying that including all of them might not add much additional information to a predictive model.