# Gold Dataset Generator
<font size=3><strong>Author:</strong> Ashkan Soltanieh<br>
<strong>Date:</strong> Jan. 15, 2022</font>

## Table of Contents

<div class="alert alert-success mt-20">
    <ul>
        <li><a href="#Approach">Approach</a></li>
        <li><a href="#Quartile Analysis">Quartile Analysis</a></li>
        <li><a href="#Metadata">Metadata</a></li>
    </ul>
</div>

## Approach:
So far we have merged and cleaned part of the data. As a quick overview, we have cleaned the two datasets for wildfires, as well as merged and cleaned weather data. Additionally, weather data are aggregated by their daily mean and standard deviation to bring consistency into dataset and alignment with wildfire datasets. The process of cleaning the weather data to create the focus weather dataset based on wildfire date and location and drop the rest of redundancy is completed, and silver dataset contains required aggregated variables for merging datasets.

In this notebook, our goal is to merge all dataset and start data preprocessing and exploratory analysis. The keys for merging Area of Burn and characteristics datasets are UID_Fire, and REF_ID, and Date_of_Burn. For merging wildfire and weather data we will be using rounded spatial locations data and date.

Finally, we are going to categorize the fire data into five categories. The categories will be selected based on quartile analysis for area of burn data. Ranking wildfire based on different citeria is a common practice. Here is an example of this has been done in Government of BC ([Wildfire Ranking](https://www2.gov.bc.ca/gov/content/safety/wildfire-status/about-bcws/wildfire-response/fire-characteristics/rank)).

In [1]:
import os
import pandas as pd
import sys
from IPython.display import display
sys.path.insert(1, os.path.abspath(os.path.join(os.getcwd(),"..","src/data")))
from weather import get_rounded_locations

In [2]:
path_weather = os.path.abspath(os.path.join(os.getcwd(), '../data/processed/weather/silver/silver_weather-daily-mean-std.csv'))
path_aob = os.path.abspath(os.path.join(os.getcwd(), '../data/processed/wildfire/silver/silver_AoB.csv'))
path_characteristics = os.path.abspath(os.path.join(os.getcwd(), '../data/processed/wildfire/silver/silver_chracteristics.csv'))

In [3]:
df_weather = pd.read_csv(path_weather)
df_aob = pd.read_csv(path_aob)
df_characteristics = pd.read_csv(path_characteristics)

In [4]:
df_aob_merge = df_aob.rename(columns={'Date_of_Burn' : 'Date'}).set_index(['UID_Fire', 'REF_ID', 'Date'])
df_characteristics_merge = df_characteristics.set_index(['UID_Fire', 'REF_ID', 'Date'])
df_characteristics_aob = df_characteristics_merge.merge(df_aob_merge, on=['UID_Fire', 'REF_ID', 'Date'], how='inner')
df_characteristics_aob.reset_index(drop=False, inplace=True)

In [5]:
display(df_aob.shape)
display(df_characteristics.shape)
display(df_characteristics_aob.shape)

(14891, 4)

(25442, 15)

(23798, 16)

Comparison between characteristics and area of burn data shows that 23,798 fire observations have a corresponding area of burn recording. These data will be further refined during exploratory analysis.

In [6]:
df_characteristics_aob_merge = df_characteristics_aob.set_index(['rounded_lat', 'rounded_lon', 'Date'])
df_weather_merge = df_weather.rename(columns = {'latitude': 'rounded_lat', 'longitude': 'rounded_lon', 'date':'Date'}).set_index(['rounded_lat', 'rounded_lon', 'Date'])
df_characteristics_aob_weather = df_characteristics_aob_merge.merge(df_weather_merge, on=['rounded_lat', 'rounded_lon', 'Date'], how='inner')
df_characteristics_aob_weather.reset_index(drop=False, inplace=True)
df_characteristics_aob_weather.set_index(['UID_Fire', 'REF_ID', 'Date'], inplace=True)

In [7]:
df_characteristics_aob_weather.columns

Index(['rounded_lat', 'rounded_lon', 'sat', 'Status', 'T21_mean', 'T21_std',
       'T31_mean', 'T31_std', 'FRP_mean', 'FRP_std', 'conf_mean', 'conf_std',
       'Total_AoB', 't2m_mean', 't2m_std', 'cape_mean', 'cape_std', 'd2m_mean',
       'd2m_std', 'tp_mean', 'tp_std', 'tcc_mean', 'tcc_std', 'cvh_mean',
       'cvl_mean', 'swvl1_mean', 'swvl1_std', 'wind_speed_mean',
       'wind_speed_std'],
      dtype='object')

## Quartile Analysis

We couldn't find any categorizing only based on wildfire burn area from the available references and litrature review. Therefore, since the current fire observations are high enough we decided to do fair distribution into five categories using quartile analysis based on quantile limits of Total_AoB data.

In [8]:
Q1 = df_characteristics_aob_weather['Total_AoB'].quantile(0.25)
median = df_characteristics_aob_weather['Total_AoB'].quantile(0.5)
Q3 = df_characteristics_aob_weather['Total_AoB'].quantile(0.75)
lower_extreme = Q1 - 1.5 * (Q3 - Q1) if Q1 - 1.5 * (Q3 - Q1) > 0 else 0 
upper_extreme = Q3 + 1.5 * (Q3 - Q1)

In [27]:
display(f"Q1: {Q1}", 
        f"median: {median}", 
        f"Q3: {Q3}", 
        f"lower_extreme: {lower_extreme}", 
        f"upper_extreme: {upper_extreme}")

'Q1: 0.8766504910912092'

'median: 5.785309025780867'

'Q3: 33.80179223576056'

'lower_extreme: 0'

'upper_extreme: 83.18950485276457'

In [9]:
df_characteristics_aob_weather["AoB_Category"] = ''

df_characteristics_aob_weather.loc[
    (df_characteristics_aob_weather['Total_AoB'] > lower_extreme) &
    (df_characteristics_aob_weather['Total_AoB'] <= Q1), 'AoB_Category'] = 'Very Low'

df_characteristics_aob_weather.loc[
    (df_characteristics_aob_weather['Total_AoB'] > Q1) &
    (df_characteristics_aob_weather['Total_AoB'] <= median), 'AoB_Category'] = 'Low'

df_characteristics_aob_weather.loc[
    (df_characteristics_aob_weather['Total_AoB'] > median) &
    (df_characteristics_aob_weather['Total_AoB'] <= Q3), 'AoB_Category'] = 'Moderate'

df_characteristics_aob_weather.loc[
    (df_characteristics_aob_weather['Total_AoB'] > Q3) &
    (df_characteristics_aob_weather['Total_AoB'] <= upper_extreme), 'AoB_Category'] = 'High'

df_characteristics_aob_weather.loc[
    (df_characteristics_aob_weather['Total_AoB'] > upper_extreme), 'AoB_Category'] = 'Very High'

df_characteristics_aob_weather['AoB_Category'].value_counts()

Low          5916
Very Low     5915
Moderate     5915
High         3115
Very High    2799
Name: AoB_Category, dtype: int64

In [10]:
df_characteristics_aob_weather.columns = ['lat', 'lon', 'satelite', 'fire_status', 't21_mean', 't21_std',
       't31_mean', 't31_std', 'frp_mean', 'frp_std', 'conf_mean', 'conf_std','total_aob', 'temp_mean', 'temp_std', 'convective_energy_mean', 'convective_energy_std', 'dewpoint_temp_mean',
       'dewpoint_temp_std', 'total_precipitation_mean', 'total_precipitation_std', 'total_cloud_cover_mean', 'total_cloud_cover_std', 'high_veg_cover_mean',
       'low_veg_cover_mean', 'soil_water_mean', 'soil_water_std', 'wind_speed_mean',
       'wind_speed_std', 'category_aob']

## Metadata

In [11]:
display(df_characteristics_aob_weather.head())
display(df_characteristics_aob_weather.shape)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lat,lon,satelite,fire_status,t21_mean,t21_std,t31_mean,t31_std,frp_mean,frp_std,...,total_precipitation_std,total_cloud_cover_mean,total_cloud_cover_std,high_veg_cover_mean,low_veg_cover_mean,soil_water_mean,soil_water_std,wind_speed_mean,wind_speed_std,category_aob
UID_Fire,REF_ID,Date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
334,BC-2010-G60081,2010-03-11,55.5,-123.75,A,primary,38.55,0.0,-6.75,0.0,17.4,0.0,...,6.8e-05,0.968827,0.091779,0.998444,0.000397,0.324388,0.000648,4.064986,1.167655,Very Low
334,BC-2010-G60081,2010-03-11,55.5,-123.75,A,residual,38.55,0.0,-6.75,0.0,17.4,0.0,...,6.8e-05,0.968827,0.091779,0.998444,0.000397,0.324388,0.000648,4.064986,1.167655,Very Low
676,AB-2010-SWF045,2010-03-19,55.25,-116.75,A,primary,47.6,8.131728,-0.2,1.202082,26.5,9.192388,...,1e-06,0.230681,0.338133,0.929532,0.051989,0.385003,0.001202,3.08896,0.954334,Very Low
676,AB-2010-SWF045,2010-03-19,55.25,-116.75,T,primary,44.916667,10.227577,-1.383333,2.458319,88.066667,30.624391,...,1e-06,0.230681,0.338133,0.929532,0.051989,0.385003,0.001202,3.08896,0.954334,Very Low
677,AB-2010-SWF049,2010-03-19,55.25,-116.75,A,primary,52.7,15.344217,-0.2,1.202082,33.75,19.445436,...,1e-06,0.230681,0.338133,0.929532,0.051989,0.385003,0.001202,3.08896,0.954334,Very Low


(23660, 30)

In [12]:
path_gold = os.path.abspath(
        os.path.join(os.getcwd(), "../data/processed/gold_wildfire_weather_merged.csv"))
df_characteristics_aob_weather.to_csv(path_gold, index = True)