<a href="https://colab.research.google.com/github/afennell-tech/USGS_Wildfires/blob/dev/ExploratoryAnalysis_AF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Note
The data for [1.88 Million US Wildfires](https://www.kaggle.com/rtatman/188-million-us-wildfires) is very large, so we store the file in google drive, rather than in our github repository. On Kaggle, the file provided is a SQLite database containing information on US wildfires. For the purpose of this project, we will utilize the sqlite3 library. Feel free to download the file to your local machine if you prefer. Click [here](https://drive.google.com/drive/folders/18YlVzuPCf-IXeQQSy0F3H32oG_KHEBhr?usp=sharing) to access the folder containing all data used for this project.

# Initial Setup:

Before running any of the below cells: 
1. Go to google drive (gdrive)
2. Find the 'USGS_Wildfires_Project_Content' folder, which should be in the 'Shared with me' section of your gdrive
3. Right click on the folder, and select 'Add shortcut to drive'
4. Click 'Add shortcut'
5. The folder should then appear in 'My Drive' section of gdrive

***Users only need to complete the above task once.*** 

# Getting Started: Workspace Setup

### Mounting Google Drive to Google Colab
Note: Any time the runtime is reset, you will need to reauthenticate to mount gdrive

In [87]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive' # default for the drive
PROJ = 'MyDrive/USGS_Wildfires_Project_Content' # path to project on Drive
PROJ_PATH = join(ROOT, PROJ)
DATA_PATH = join(PROJ_PATH, 'data')

drive.mount(ROOT) # we mount the drive at /content/drive

""" After executing the above code, the folder 'drive' will appear under 
the files section. This is the users respective gdrive """

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


" After executing the above code, the folder 'drive' will appear under \nthe files section. This is the users respective gdrive "

# Exploratory Analysis

### Helper Functions

In [3]:
import sqlite3 # to deal with database 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api

"""
Returns a Connection object that represents the input db.
"""
def get_sql_connection(sql_file): 
    return sqlite3.connect(sql_file)

"""
Returns a df for table from Connection object.
"""
def get_table(table_name, conn):
    query = "Select * from {}".format(table_name)
    return pd.read_sql_query(query, conn)

"""
Provided the input dataframe, function prints out the number of values each 
column takes on and if this number is less than input max_out, the corresponding 
values are printed as well.
"""
def print_col_info(df, max_out=5):
    # check if input is valid
    assert len(df.columns) > 0
    # iterate over each column
    for col_name, col in df.items(): 
        if len(col.value_counts()) <= max_out: 
            print(f"""Column name: {col_name}, NaN count: {col.isna().sum()}, 
            # of non-null values: {len(col.value_counts(dropna=False))}, 
            distinct values: {col.value_counts().index.tolist()}""")
        else: 
            print(f"""Column name: {col_name}, NaN count: {col.isna().sum()}, 
            # of non-null values: {len(col.value_counts(dropna=False))}""")

  import pandas.util.testing as tm


### Time Series Data Wrangling

In [95]:
""" Fetches date range of data set

Args: 
    in_df: dateframe for data set 
    year_col: string denoting the column name that holds year data for in_df
    doy_col: string denoting the column name that holds (discovered) doy data 
    for in_df

Returns: 
    Returns a fixed frequency DatetimeIndex for the input df 
"""
def get_date_range(in_df, year_col='FIRE_YEAR', doy_col='DISCOVERY_DOY'):

    # get data for start and end year
    start_yr = in_df[year_col].value_counts(sort=False).index.tolist()[0]
    end_yr = in_df[year_col].value_counts(sort=False).index.tolist()[-1]

    # get data for start and end doy
    start_doy = in_df[in_df[year_col] == start_yr][doy_col].min()
    end_doy = in_df[in_df[year_col] == end_yr][doy_col].max()

    # build output data
    start_date = pd.to_datetime(start_yr*1000 + start_doy, format='%Y%j')
    end_date = pd.to_datetime(end_yr*1000 + end_doy, format='%Y%j')
    
    return pd.date_range(start=start_date, end=end_date)

# Example usage of the above function
# complete_date_rng = get_date_range(fires_df)

"""
TODO 

Args: 
    TODO
Returns: 
    TODO
"""
def function_name(): 
    pass

### Variable Setup

In [96]:
USGS_DATA_PATH = join(DATA_PATH, 'FPA_FOD_20170508.sqlite')
usgs_db = get_sql_connection(USGS_DATA_PATH) # USGS data

### Exploration

Load data

In [94]:
# load fires data
fires_df = get_table('fires', usgs_db)
# drop OBJECTID and Shape; these are specific columns for the SQL db
fires_df.drop(columns=['OBJECTID', 'Shape'], inplace=True)

In [None]:
# First, we get some info about the dataframe itself

print(f"There are {len(fires_df)} rows and {len(fires_df.columns)} columns.")

print(f"List of all column names: {fires_df.columns}")

# get more info about the columns themselves (all)
print_col_info(fires_df)

Add columns that have properly formatted dates for the date fire is discovered and date fire is contained (some will be null dates)

In [98]:
# add columns that contain proper dates for each fire
fires_df['DATE_DISCOVERED'] = pd.to_datetime(fires_df['FIRE_YEAR']*1000 + fires_df['DISCOVERY_DOY'], format='%Y%j')
fires_df['DATE_CONTAINED'] = pd.to_datetime(fires_df['FIRE_YEAR']*1000 + fires_df['CONT_DOY'], format='%Y%j')
fires_df

Unnamed: 0,FOD_ID,FPA_ID,SOURCE_SYSTEM_TYPE,SOURCE_SYSTEM,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,SOURCE_REPORTING_UNIT,SOURCE_REPORTING_UNIT_NAME,LOCAL_FIRE_REPORT_ID,LOCAL_INCIDENT_ID,FIRE_CODE,FIRE_NAME,ICS_209_INCIDENT_NUMBER,ICS_209_NAME,MTBS_ID,MTBS_FIRE_NAME,COMPLEX_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE,COUNTY,FIPS_CODE,FIPS_NAME,DATE_DISCOVERED,DATE_CONTAINED
0,1,FS-1418826,FED,FS-FIRESTAT,FS,USCAPNF,Plumas National Forest,0511,Plumas National Forest,1,PNF-47,BJ8K,FOUNTAIN,,,,,,2005,2453403.5,33,1300,9.0,Miscellaneous,2453403.5,33.0,1730,0.10,A,40.036944,-121.005833,5.0,USFS,CA,63,063,Plumas,2005-02-02,2005-02-02
1,2,FS-1418827,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,0503,Eldorado National Forest,13,13,AAC0,PIGEON,,,,,,2004,2453137.5,133,0845,1.0,Lightning,2453137.5,133.0,1530,0.25,A,38.933056,-120.404444,5.0,USFS,CA,61,061,Placer,2004-05-12,2004-05-12
2,3,FS-1418835,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,0503,Eldorado National Forest,27,021,A32W,SLACK,,,,,,2004,2453156.5,152,1921,5.0,Debris Burning,2453156.5,152.0,2024,0.10,A,38.984167,-120.735556,13.0,STATE OR PRIVATE,CA,17,017,El Dorado,2004-05-31,2004-05-31
3,4,FS-1418845,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,0503,Eldorado National Forest,43,6,,DEER,,,,,,2004,2453184.5,180,1600,1.0,Lightning,2453189.5,185.0,1400,0.10,A,38.559167,-119.913333,5.0,USFS,CA,3,003,Alpine,2004-06-28,2004-07-03
4,5,FS-1418847,FED,FS-FIRESTAT,FS,USCAENF,Eldorado National Forest,0503,Eldorado National Forest,44,7,,STEVENOT,,,,,,2004,2453184.5,180,1600,1.0,Lightning,2453189.5,185.0,1200,0.10,A,38.559167,-119.933056,5.0,USFS,CA,3,003,Alpine,2004-06-28,2004-07-03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1880460,300348363,2015CAIRS29019636,NONFED,ST-CACDF,ST/C&L,USCASHU,Shasta-Trinity Unit,CASHU,Shasta-Trinity Unit,591814,009371,,ODESSA 2,,,,,,2015,2457291.5,269,1726,13.0,Missing/Undefined,2457291.5,269.0,1843,0.01,A,40.481637,-122.389375,13.0,STATE OR PRIVATE,CA,,,,2015-09-26,2015-09-26
1880461,300348373,2015CAIRS29217935,NONFED,ST-CACDF,ST/C&L,USCATCU,Tuolumne-Calaveras Unit,CATCU,Tuolumne-Calaveras Unit,569419,000366,,,,,,,,2015,2457300.5,278,0126,9.0,Miscellaneous,,,,0.20,A,37.617619,-120.938570,12.0,MUNICIPAL/LOCAL,CA,,,,2015-10-05,NaT
1880462,300348375,2015CAIRS28364460,NONFED,ST-CACDF,ST/C&L,USCATCU,Tuolumne-Calaveras Unit,CATCU,Tuolumne-Calaveras Unit,574245,000158,,,,,,,,2015,2457144.5,122,2052,13.0,Missing/Undefined,,,,0.10,A,37.617619,-120.938570,12.0,MUNICIPAL/LOCAL,CA,,,,2015-05-02,NaT
1880463,300348377,2015CAIRS29218079,NONFED,ST-CACDF,ST/C&L,USCATCU,Tuolumne-Calaveras Unit,CATCU,Tuolumne-Calaveras Unit,570462,000380,,,,,,,,2015,2457309.5,287,2309,13.0,Missing/Undefined,,,,2.00,B,37.672235,-120.898356,12.0,MUNICIPAL/LOCAL,CA,,,,2015-10-14,NaT


### For now, we only care about exploring the following variables: 
OWNER_DESCR, FIRE_SIZE, FIRE_SIZE_CLASS, FIRE_YEAR, DISCOVERY_DATE, STAT_CAUSE_DESCR, LATITUDE, LONGITUDE, STATE, COUNTY


In [7]:
"""
Subset fires_df to explore the above columns. Find necessary and informative
descriptive statistics, clean the data, make simple visualizations, run simple
regressions, etc. Just do whatever feels right so we can begin to understand 
what steps to take moving forward. 
"""

'\nSubset fires_df to explore the above columns. Find necessary and informative\ndescriptive statistics, clean the data, make simple visualizations, run simple\nregressions, etc. Just do whatever feels right so we can begin to understand \nwhat steps to take moving forward. \n'

#### We have a categorical variable that delineates fire sizes into 7 categories (FIRE_SIZE_CLASS), so let's explore how the 7 groups differ over variables such as FIRE_YEAR, STATE, and COUNTY. 

Notice that COUNTY has 678148 different null values and only 3456 non-null values, so leave COUNTY out of the exploration for now. Maybe we study COUNTY for specific states later on.

In [8]:
# create a subset df containing the columns listed above
subs_cols = ['FIRE_SIZE_CLASS', 'FIRE_YEAR', 'STATE']
fires_size_df = fires_df[subs_cols].copy(deep=True)
print(f"Note: We have {fires_size_df.isnull().values.sum()} null values in the df")
fires_size_df

Note: We have 0 null values in the df


Unnamed: 0,FIRE_SIZE_CLASS,FIRE_YEAR,STATE
0,A,2005,CA
1,A,2004,CA
2,A,2004,CA
3,A,2004,CA
4,A,2004,CA
...,...,...,...
1880460,A,2015,CA
1880461,A,2015,CA
1880462,A,2015,CA
1880463,B,2015,CA


In [None]:
# Plot fire_size_class counts for each state


# groupby state
states_group = fires_size_df.groupby('STATE', sort=False)

# iterate over each state
for state, state_df in states_group: 
    print(state)
    year_st_group = state_df.groupby('FIRE_YEAR')
    # iterate over each year, but only use increments of 6 years (4 plots)
    for year, year_df in year_st_group:
        # only want 4 plots
        if year % 6 == 0: 
            print()


### Question to study: Does fire size class (7 classes) vary by State and/or Year?

In [None]:
# To study this question, we need to reorganize our data

Notice how CA, GA, TX, and NC have a high number 

In [None]:
fires_size_df['STATE'].value_counts()

#### Another interesting exploration could be to visuzalize FIRE_SIZE over a time period such as years.