<a href="https://colab.research.google.com/github/afennell-tech/USGS_Wildfires/blob/dev/ExploratoryAnalysis_AF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Note
The data for [1.88 Million US Wildfires](https://www.kaggle.com/rtatman/188-million-us-wildfires) is very large, so we store the file in google drive, rather than in our github repository. On Kaggle, the file provided is a SQLite database containing information on US wildfires. For the purpose of this project, we will utilize the sqlite3 library. Feel free to download the file to your local machine if you prefer. Click [here](https://drive.google.com/drive/folders/18YlVzuPCf-IXeQQSy0F3H32oG_KHEBhr?usp=sharing) to access the folder containing all data used for this project.

# Initial Setup:

Before running any of the below cells: 
1. Go to google drive (gdrive)
2. Find the 'USGS_Wildfires_Project_Content' folder, which should be in the 'Shared with me' section of your gdrive
3. Right click on the folder, and select 'Add shortcut to drive'
4. Click 'Add shortcut'
5. The folder should then appear in 'My Drive' section of gdrive

***Users only need to complete the above task once.*** 

# Getting Started: Workspace Setup

### Mounting Google Drive to Google Colab
Note: Any time the runtime is reset, you will need to reauthenticate to mount gdrive

In [2]:
from google.colab import drive
from os.path import join

ROOT = '/content/drive' # default for the drive
PROJ = 'MyDrive/USGS_Wildfires_Project_Content' # path to project on Drive
PROJ_PATH = join(ROOT, PROJ)
DATA_PATH = join(PROJ_PATH, 'data')

drive.mount(ROOT) # we mount the drive at /content/drive

""" After executing the above code, the folder 'drive' will appear under 
the files section. This is the users respective gdrive """

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


" After executing the above code, the folder 'drive' will appear under \nthe files section. This is the users respective gdrive "

# Exploratory Analysis

### Helper Functions

In [3]:
import sqlite3 # to deal with database 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api

"""
Returns a Connection object that represents the input db.
"""
def get_sql_connection(sql_file): 
    return sqlite3.connect(sql_file)

"""
Returns a df for table from Connection object.
"""
def get_table(table_name, conn):
    query = "Select * from {}".format(table_name)
    return pd.read_sql_query(query, conn)

"""
Provided the input dataframe, function prints out the number of values each 
column takes on and if this number is less than input max_out, the corresponding 
values are printed as well.
"""
def print_col_info(df, max_out=5):
    # check if input is valid
    assert len(df.columns) > 0
    # iterate over each column
    for col_name, col in df.items(): 
        if len(col.value_counts()) <= max_out: 
            print(f"""Column name: {col_name}, NaN count: {col.isna().sum()}, 
            # of non-null values: {len(col.value_counts(dropna=False))}, 
            distinct values: {col.value_counts().index.tolist()}""")
        else: 
            print(f"""Column name: {col_name}, NaN count: {col.isna().sum()}, 
            # of non-null values: {len(col.value_counts(dropna=False))}""")

  import pandas.util.testing as tm


### Variable Setup

In [4]:
USGS_DATA_PATH = join(DATA_PATH, 'FPA_FOD_20170508.sqlite')
usgs_db = get_sql_connection(USGS_DATA_PATH) # USGS data

### Exploration

In [5]:
'''
Note: If the below code breaks, make sure that the USGS_DATA_PATH 
is indeed the correct path to get to the .sqlite file. 
'''

# load fires data
fires_df = get_table('fires', usgs_db)
# drop OBJECTID and Shape; these are specific columns for the SQL db
fires_df.drop(columns=['OBJECTID', 'Shape'], inplace=True)

In [6]:
# First, we get some info about the dataframe itself

print(f"There are {len(fires_df)} rows and {len(fires_df.columns)} columns.")

print(f"List of all column names: {fires_df.columns}")

# get more info about the columns themselves (all)
print_col_info(fires_df, max_out=20)

There are 1880465 rows and 37 columns.
List of all column names: Index(['FOD_ID', 'FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM',
       'NWCG_REPORTING_AGENCY', 'NWCG_REPORTING_UNIT_ID',
       'NWCG_REPORTING_UNIT_NAME', 'SOURCE_REPORTING_UNIT',
       'SOURCE_REPORTING_UNIT_NAME', 'LOCAL_FIRE_REPORT_ID',
       'LOCAL_INCIDENT_ID', 'FIRE_CODE', 'FIRE_NAME',
       'ICS_209_INCIDENT_NUMBER', 'ICS_209_NAME', 'MTBS_ID', 'MTBS_FIRE_NAME',
       'COMPLEX_NAME', 'FIRE_YEAR', 'DISCOVERY_DATE', 'DISCOVERY_DOY',
       'DISCOVERY_TIME', 'STAT_CAUSE_CODE', 'STAT_CAUSE_DESCR', 'CONT_DATE',
       'CONT_DOY', 'CONT_TIME', 'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE',
       'LONGITUDE', 'OWNER_CODE', 'OWNER_DESCR', 'STATE', 'COUNTY',
       'FIPS_CODE', 'FIPS_NAME'],
      dtype='object')
Column name: FOD_ID, NaN count: 0, 
            # of non-null values: 1880465
Column name: FPA_ID, NaN count: 0, 
            # of non-null values: 1880462
Column name: SOURCE_SYSTEM_TYPE, NaN count: 0, 
     

### For now, we only care about exploring the following variables: 
OWNER_DESCR, FIRE_SIZE, FIRE_SIZE_CLASS, FIRE_YEAR, DISCOVERY_DATE, STAT_CAUSE_DESCR, LATITUDE, LONGITUDE, STATE, COUNTY


In [7]:
"""
Subset fires_df to explore the above columns. Find necessary and informative
descriptive statistics, clean the data, make simple visualizations, run simple
regressions, etc. Just do whatever feels right so we can begin to understand 
what steps to take moving forward. 
"""

'\nSubset fires_df to explore the above columns. Find necessary and informative\ndescriptive statistics, clean the data, make simple visualizations, run simple\nregressions, etc. Just do whatever feels right so we can begin to understand \nwhat steps to take moving forward. \n'

#### We have a categorical variable that delineates fire sizes into 7 categories (FIRE_SIZE_CLASS), so let's explore how the 7 groups differ over variables such as FIRE_YEAR, STATE, and COUNTY. 

Notice that COUNTY has 678148 different null values and only 3456 non-null values, so leave COUNTY out of the exploration for now. Maybe we study COUNTY for specific states later on.

In [8]:
# create a subset df containing the columns listed above
subs_cols = ['FIRE_SIZE_CLASS', 'FIRE_YEAR', 'STATE']
fires_size_df = fires_df[subs_cols].copy(deep=True)
print(f"Note: We have {fires_size_df.isnull().values.sum()} null values in the df")
fires_size_df

Note: We have 0 null values in the df


Unnamed: 0,FIRE_SIZE_CLASS,FIRE_YEAR,STATE
0,A,2005,CA
1,A,2004,CA
2,A,2004,CA
3,A,2004,CA
4,A,2004,CA
...,...,...,...
1880460,A,2015,CA
1880461,A,2015,CA
1880462,A,2015,CA
1880463,B,2015,CA


In [17]:
# Plot fire_size_class counts for each state


# groupby state
states_group = fires_size_df.groupby('STATE', sort=False)

# iterate over each state
for state, state_df in states_group: 
    print(state)
    year_st_group = state_df.groupby('FIRE_YEAR')
    # iterate over each year, but only use increments of 6 years (4 plots)
    for year, year_df in year_st_group:
        # only want 4 plots
        if year % 6 == 0: 
            print()


CA




NM




OR




NC




WY




CO




WA




MT




UT




AZ




SD




AR




NV




ID




MN




TX




FL




SC




LA




OK




KS




MO




NE




MI




KY




OH




IN




VA




IL




TN




GA




AK




ND




WV




WI




AL




NH




PA




MS




ME




VT




NY




IA




DC



MD




CT




MA




NJ




HI




DE



PR




RI


### Question to study: Does fire size class (7 classes) vary by State and/or Year?

In [9]:
# To study this question, we need to reorganize our data




### 

Notice how CA, GA, TX, and NC have a high number 

In [10]:
fires_size_df['STATE'].value_counts()

CA    189550
GA    168867
TX    142021
NC    111277
FL     90261
SC     81315
NY     80870
MS     79230
AZ     71586
AL     66570
OR     61088
MN     44769
OK     43239
MT     40767
NM     37478
ID     36698
CO     34157
WA     33513
WI     31861
AR     31663
TN     31154
SD     30963
UT     30725
LA     30013
KY     27089
NJ     25949
PR     22081
WV     21967
VA     21833
MO     17953
NV     16956
ND     15201
WY     14166
ME     13150
AK     12843
MI     10502
HI      9895
PA      8712
NE      7973
KS      7673
CT      4976
IA      4134
MD      3622
OH      3479
MA      2626
NH      2452
IL      2327
IN      2098
RI       480
VT       456
DE       171
DC        66
Name: STATE, dtype: int64

#### Another interesting exploration could be to visuzalize FIRE_SIZE over a time period such as years.