# Learning the Cause of Wildfires
### by Cater Portwood and Christine Miller
#### STA 208 Final Project, Spring 2019

## Introduction
Wildfires are an increasingly common threat to human lives, infrastructure, and wildlife. While occasional wildfires are often a natural part of an ecosystem, the majority of wildfires in the US are human-caused. Determining how fires start is important in suggesting future preventive measures and in bringing those responsible to justice. Currently firefighters, law enforcement, and land managers use a variety of clues to ascertain the cause of a fire. These include burn patterns, location, and local weather. 

Our question is whether machine learning can be used to predict the cause of a fire based on the characteristics of a fire. In particular, can we create a model that will accurately predict whether a fire was caused by arson or not? Characteristics available for prediction include the duration, location, size, land owner, and population density. This project aims to create an additional tool to help prevent wildfires.


## Data
We will be using data on 1.88 million US wildfires from Kaggle: https://www.kaggle.com/rtatman/188-million-us-wildfires

This dataset contains information on the location, timing, duration and final size of the fire, along with identifing information about each fire and the source of the information. A full description of each of the viaraibles included in the dataset can be found at the kaggle link. 
        
In addition to the wildfire dataset we also used information on locations of urban areas from the 2010 census from DATA.gov: https://catalog.data.gov/dataset/tiger-line-shapefile-2017-2010-nation-u-s-2010-census-urban-area-national 

Fire depratments often use information about human presence and influence on the fire's location to suggest the cause of a fire, and we noticed that this information was missing from the data set. The Urban Areas data contains geographic polygons 
that define urban areas with two categories of population density: urbanized areas (UAs) that contain 50,000 or more people and urban clusters (UCs) that contain at least 2,500 people, but fewer than 50,000 people. We used the latitude and longitude coordinates in the fire data set to extract the population desensity for each fire (Code: ExtractUrban.R). 

In [1]:
# load packages
import numpy as np
import pandas as pd
import matplotlib as mpl
import plotnine as p9
import itertools
import sklearn as skl
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
import sklearn.preprocessing
import sklearn.mixture
import matplotlib.pyplot as plt

## Load Data

In [2]:
# load data
og_fires = pd.read_csv('wildfires.csv', engine='python')

In [54]:
# Subset data to speed computation
percent_of_data = .10
np.random.seed(5)
indices = np.random.randint(0,len(og_fires), int(len(og_fires)*percent_of_data))
fires = og_fires.iloc[indices,:]

In [73]:
# Drop observations with "missing/undefined" cause
# Save for later, perhaps
missing = fires[fires.STAT_CAUSE_DESCR == "Missing/Undefined"]
fires = fires[fires.STAT_CAUSE_DESCR != "Missing/Undefined"]

In [74]:
# Look at proportion of missing data
fires.isnull().sum()/fires.shape[0]

V1                            0.000000
OBJECTID                      0.000000
FOD_ID                        0.000000
FPA_ID                        0.000000
SOURCE_SYSTEM_TYPE            0.000000
SOURCE_SYSTEM                 0.000000
NWCG_REPORTING_AGENCY         0.000000
NWCG_REPORTING_UNIT_ID        0.000000
NWCG_REPORTING_UNIT_NAME      0.000000
SOURCE_REPORTING_UNIT         0.000000
SOURCE_REPORTING_UNIT_NAME    0.000000
LOCAL_FIRE_REPORT_ID          0.756411
LOCAL_INCIDENT_ID             0.410870
FIRE_CODE                     0.813941
FIRE_NAME                     0.489204
ICS_209_INCIDENT_NUMBER       0.986814
ICS_209_NAME                  0.986814
MTBS_ID                       0.993906
MTBS_FIRE_NAME                0.993906
COMPLEX_NAME                  0.996807
FIRE_YEAR                     0.000000
DISCOVERY_DATE                0.000000
DISCOVERY_DOY                 0.000000
DISCOVERY_TIME                0.448665
STAT_CAUSE_CODE               0.000000
STAT_CAUSE_DESCR         

In [75]:
fires.describe(include="all")

Unnamed: 0,V1,OBJECTID,FOD_ID,FPA_ID,SOURCE_SYSTEM_TYPE,SOURCE_SYSTEM,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,SOURCE_REPORTING_UNIT,...,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE,COUNTY,urbantype,duration
count,171317.0,171317.0,171317.0,171317,171317,171317,171317,171317,171317,171317,...,171317.0,171317,171317.0,171317.0,171317.0,171317,171317,107631,171317,91829.0
unique,,,,163084,3,32,8,1046,1045,3161,...,,7,,,,16,52,2631,3,
top,,,,W-382225,NONFED,ST-NASF,ST/C&L,USGAGAS,Georgia Forestry Commission,GAGAS,...,,B,,,,MISSING/NOT SPECIFIED,CA,SUFFOLK,R,
freq,,,,4,123587,64201,123882,16667,16667,9754,...,,85230,,,,90863,17725,752,151779,
mean,928575.5,928575.5,52019100.0,,,,,,,,...,74.994311,,37.105066,-95.777205,10.355248,,,,,1.189886
std,543304.9,543304.9,99188350.0,,,,,,,,...,2576.056951,,5.775375,15.906333,4.48315,,,,,8.442972
min,10.0,10.0,10.0,,,,,,,,...,9e-05,,17.9619,-165.8527,0.0,,,,,0.0
25%,437256.0,437256.0,471376.0,,,,,,,,...,0.1,,32.8867,-110.0015,8.0,,,,,0.0
50%,952162.0,952162.0,1080254.0,,,,,,,,...,1.0,,35.60675,-92.323333,14.0,,,,,0.0
75%,1392842.0,1392842.0,19088700.0,,,,,,,,...,3.0,,41.0371,-82.5592,14.0,,,,,0.0


In [76]:
# make new variable of fire duration with date columns
fires['duration'] = fires.CONT_DOY - fires.DISCOVERY_DOY
fires.loc[fires.duration < 0, "duration"] += 365

fires.duration.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


count    91829.000000
mean         1.189886
std          8.442972
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        335.000000
Name: duration, dtype: float64

In [77]:
# made id dictionaries for cause codes and owner codes
#fires[['STAT_CAUSE_CODE', 'STAT_CAUSE_DESCR']].to_dict(orient=list)
set(fires.STAT_CAUSE_CODE), set(fires.STAT_CAUSE_DESCR)

({1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12},
 {'Arson',
  'Campfire',
  'Children',
  'Debris Burning',
  'Equipment Use',
  'Fireworks',
  'Lightning',
  'Miscellaneous',
  'Powerline',
  'Railroad',
  'Smoking',
  'Structure'})

In [52]:
# trim to the predictor variables of interest
predictors = fires[['FIRE_YEAR','DISCOVERY_DOY',
       'DISCOVERY_TIME', 'CONT_DATE',
       'CONT_DOY', 'CONT_TIME', 'FIRE_SIZE', 'LATITUDE',
       'LONGITUDE', 'OWNER_DESCR', 'urbantype', 'duration']]
target = fires['STAT_CAUSE_DESCR']

In [49]:
# Id the numberic columns before label encoding so we can scale them later
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
num_cols = predictors.select_dtypes(include=numerics).columns
num_cols

Index(['FIRE_YEAR', 'DISCOVERY_DOY', 'DISCOVERY_TIME', 'CONT_DATE', 'CONT_DOY',
       'CONT_TIME', 'FIRE_SIZE', 'LATITUDE', 'LONGITUDE', 'duration'],
      dtype='object')

In [53]:
## encode all categorical variables including the fire causes
# make labelEncoder for fire causes
cause_encode = skl.preprocessing.LabelEncoder()
cause_encode.fit(list(target.values)) 
target = cause_encode.transform(list(target.values))

# make labelEncoder owner description
owner_encode = skl.preprocessing.LabelEncoder()
owner_encode.fit(list(predictors['OWNER_DESCR'].values))
predictors['OWNER_DESCR'] = owner_encode.transform(list(predictors['OWNER_DESCR'].values))

# make labelEncoder for urbantype
urban_encode = skl.preprocessing.LabelEncoder()
urban_encode.fit(list(predictors['urbantype'].values))
predictors['urbantype'] = urban_encode.transform(list(predictors['urbantype'].values))  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [58]:
# training/test split (hold out 1/5 of the data for testing)
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(predictors, target, test_size = 0.2)

In [63]:
def scale_impute(x_data, num_cols):
    # scale numberic columns of x data
    mm_scaler = skl.preprocessing.MinMaxScaler(feature_range=(0, 1))

    x_scaled = mm_scaler.fit_transform(x_data[num_cols])
    x_data[num_cols] = x_scaled

    # make new columns indicating what will be imputed
    cols_with_missing = [col for col in x_data.columns 
                                 if x_data[col].isnull().any()]
    
    for col in cols_with_missing:
        x_data[col + '_was_missing'] = x_data[col].isnull()

    col_names = x_data.columns
    
    # Imputation
    my_imputer = SimpleImputer()
    x_data = pd.DataFrame(my_imputer.fit_transform(x_data))
    x_data.columns = col_names
    
    return(x_data)

In [65]:
X_train_imp = scale_impute(X_train, num_cols)
X_train_imp.head()

X_test_imp = scale_impute(X_test, num_cols)
X_test_imp.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/p

Unnamed: 0,FIRE_YEAR,DISCOVERY_DOY,DISCOVERY_TIME,CONT_DATE,CONT_DOY,CONT_TIME,FIRE_SIZE,LATITUDE,LONGITUDE,OWNER_DESCR,urbantype,duration,DISCOVERY_TIME_was_missing,CONT_DATE_was_missing,CONT_DOY_was_missing,CONT_TIME_was_missing,duration_was_missing
0,0.521739,0.323288,0.733362,0.51352,0.323288,0.763035,1.858548e-06,0.576631,0.689078,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.695652,0.227397,0.615337,0.526874,0.469937,0.649914,1.858381e-07,0.325915,0.885647,6.0,1.0,0.000346,1.0,1.0,1.0,1.0,1.0
2,0.826087,0.10411,0.60195,0.526874,0.469937,0.649914,3.717115e-06,0.362723,0.880899,6.0,1.0,0.000346,0.0,1.0,1.0,1.0,1.0
3,0.869565,0.252055,0.734209,0.843925,0.252055,0.739296,1.858381e-07,0.508958,0.779174,11.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.334247,0.854599,0.013919,0.334247,0.854599,1.858548e-06,0.418996,0.863136,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


## Arson vs Not-Arson

### Benchmarking 