# Silver Dataset Generation using Wildfire Spatial Data
<font size=3><strong>Author:</strong> <a href="https://www.linkedin.com/in/~ashkan/" target="_blank">Ashkan Soltanieh</a><br>
<strong>Date:</strong>  Jan. 8, 2022</font>

## Table of Contents

<div class="alert alert-success mt-20">
    <ul>
        <li><a href="#Approach">Approach</a></li>
        <li><a href="#Area of Burn Data">Area of Burn Data</a></li>
        <li><a href="#Characteristics Data">Characteristics Data</a></li>
        <li><a href="#Temprature Standardization">Temperature Standardization</a></li>
        <li><a href="#Metadata">Metadata</a></li>
    </ul>
</div>

## Approach:
Raw uncleaned (bronze) dataset will be further refined in this notebood, the wildfire data in two different csv files will be merged and aggregated so that each row will represent an individual fire accident.

The lateral and longitude data are rounded to the nearest quarter number to match the scraped weather data, so part of the characteristics data cleaning is determining <code>rounded_lat</code> and <code>rounded_lon</code> data.

## Area of Burn Data
The area of burn(AoB) data in form of Polygon. The AoB has been selected as the label data. Each Polygon object has an area property which will be used to obtain these data. Each row of AoB cannot be uniquely identified by any of the other columns. Therefore, we will be aggregating the area data by UID_Fire, Date of Burn, and REF_ID (Unique Identifier). Here are the unused variable which will be dropped from the dataset:<br>

**Dropped Columns:**<br>
> **FD_Agency:** Redundant data as all data in current dataset are from Canada<br>
> **JD, date_src, Year:** Date related data are covered in characteristics dataset. Only Map_Date is kept.<br>

It's observed that for a UID_Fire and REF_ID of a fire for each day can have multiple area recordings; therefore, the area are aggregated by sum to show total burn area for each fire of each day.

## Characteristics Data
Most of the variables in this dataset will be selected as features. In this notebook, the variables with missing or redundant information are dropped. Below is the list of the dropped variables and the reason behind making this decision:

**Dropped Columns:**<br>
> **FD_Agency:** Redundant data as all data in current dataset are from Canada<br>
> **dn:** This variable is missing for observations before 2016. We dropped it for consistency purpose among all observation<br>
> **HHMM:** Time vairable will not be used as index as fire data will be aggregated daily like weather data<br>
> **sample:** Other identifiers are used instead of this variable<br>
> **type:** It's redundant for Alberta and British Columnbia dataset as only type zero(presumed vegetation fire) exist in the table<br>
> **geometry:** There is EPSG 4326 representation of the point in lat/lng columns.<br>

## Temperature Standardization
T21 and T31 brightness temperature of fire data are in Kelvin unit. For easier understanding, their value of are replaced by Celsius equivalent. It is added into the python script at <code>src/data/wildfire.py</code> module.

In [1]:
import os
import numpy as np
import pandas as pd
from shapely.wkt import loads
import sys
from IPython.display import display

In [2]:
#bronze datasets path for wildfire
path_aob_bronze = os.path.abspath(os.path.join(os.getcwd(), '../data/processed/wildfire/bronze/bronze_AoB.csv'))
path_characteristics_bronze = os.path.abspath(os.path.join(os.getcwd(), '../data/processed/wildfire/bronze/bronze_chracteristics.csv'))

## Decision making on wildfire data aggregation type
### Area of burn

Both area of burn data, and characteristics data have duplication in their suggested indices. According to MODIS documentation, each fire can be uniquly identified by <code>['UID_Fire', 'REF_ID', 'Map_Date']</code>. However, from our observation there are over 425k row duplicated data based on this criteria.

In [3]:
df_aob = pd.read_csv(path_aob_bronze, dtype={'UID_Fire': str}).set_index(['UID_Fire', 'REF_ID', 'Map_Date'])
display('Duplicated area of burn records: ', np.unique(df_aob.index.duplicated(), return_counts=True))

'Duplicated area of burn records: '

(array([False,  True]), array([ 14891, 425632]))

This indicates multiple area of burns can be caused by a single fire. To accurately aggregete these data, we will be using sum of areas for grouping besed in provided unique identifiers for area of burn.

### Characteristics

In [4]:
df_characteristics = pd.read_csv(path_characteristics_bronze, low_memory = False).set_index(['UID_Fire', 'REF_ID', 'YYYYMMDD'])
display('Duplicated characteristics records: ', np.unique(df_characteristics.index.duplicated(), return_counts=True))

'Duplicated characteristics records: '

(array([False,  True]), array([  8268, 155972]))

It seems like similarly to area of burn data, characteristics data also shown around 156k records of duplicated indicies. A quick look over the characteristics data shows that these data can additionally be categrized by satelite, fire status and latitude and longitde location. To make more efficient data aggregation, we're also rounding latitude and longitude data to their nearest quarter to match weather data recording and make them more identifiable by area of burn data.

Characteristics data are aggregated for mean and standard deviation for contineous variables based on the following list of categorical variables <code>['Date', 'sat', 'UID_Fire', 'Status', 'REF_ID', 'rounded_lat', 'rounded_lon']</code>. 

In [6]:
# create the silver dataframe using custom script
sys.path.insert(1, os.path.abspath(os.path.join(os.getcwd(),"..","src/data")))
from wildfire import make_silver_dataframes

df_aob, df_characteristics = make_silver_dataframes(path_aob_bronze, path_characteristics_bronze)

In [9]:
# save extracted cleaned dataframe into silver datasets
path_aob = os.path.abspath(
        os.path.join(os.getcwd(), "../data/processed/wildfire/silver/silver_AoB.csv"))
path_characteristics = os.path.abspath(
        os.path.join(os.getcwd(), "../data/processed/wildfire/silver/silver_chracteristics.csv"))

df_aob.to_csv(path_aob, index = True)
df_characteristics.to_csv(path_characteristics, index = True)

## Metadata

In [10]:
display(df_aob.head())
display(df_aob.shape)
display(df_characteristics.head())
display(df_characteristics.shape)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Total_AoB
UID_Fire,REF_ID,Date_of_Burn,Unnamed: 3_level_1
100,BC-2011-V30040,2011-05-20,0.021933
100,BC-2014-G80090,2014-05-31,0.107307
1000,AB-2014-HWF124,2014-06-29,0.255955
1000,AB-2015-SWF061,2015-05-22,0.022832
1000,AB-2016-EWF008,2016-04-08,0.01035


(14891, 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,lat_mean,lat_std,lon_mean,lon_std,T21_mean,T21_std,T31_mean,T31_std,FRP_mean,FRP_std,conf_mean,conf_std
Date,sat,UID_Fire,Status,REF_ID,rounded_lat,rounded_lon,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2010-01-12,A,313,removed,BC-2010-G40151,53.75,-124.25,53.829,0.0,-124.332,0.0,69.75,0.0,2.45,0.0,110.7,0.0,93.0,0.0
2010-01-12,T,313,removed,BC-2010-G40151,53.75,-124.25,53.832,0.0,-124.335,0.0,44.35,0.0,0.15,0.0,82.7,0.0,64.0,0.0
2010-01-13,A,313,removed,BC-2010-G40151,53.75,-124.25,53.838,0.001414,-124.3265,0.010607,71.9,8.980256,-3.55,0.424264,63.8,18.526198,81.5,3.535534
2010-01-18,A,313,removed,BC-2010-G40151,53.75,-124.25,53.845,0.0,-124.302,0.0,37.45,0.0,-2.75,0.0,25.8,0.0,63.0,0.0
2010-01-22,A,211,removed,BC-2010-C10299,53.25,-123.75,53.316,0.0,-123.856,0.0,72.55,0.0,-3.45,0.0,60.5,0.0,84.0,0.0


(25442, 12)

## <h3 align="center"> Copyright © 2022 - All rights reserved by the author.<h3/>