# Data Science in a Blaze - Part 1

### Self education in Data Science

Tying up my time at Mapillary in mid January, I decided that instead of diving directly back into a job hunt, that I would take a bit of time to hone existing skills and develop some new. Ever since extracting and visualizing sentiment data from New York Times comments as an undergraduate in design school, I've been attacted to data analysis as a means to gain and share a greater understanding of the world. With some time on my hands and a passion for writing software in Python, I decided to dive in heads first. This is a first this series of blog posts where I'll catalog my self-education and share some  of what I build.

Thus far, my eduction has been self driven on [Kaggle](http://kaggle.com/) - there's a ton of knowledge there! While my experience diving into the domain has been pretty humbling -- pretty much everyone around me knows so much more, it's also been enlightening and energizing -- the possibilities are boundless.

Thinking about next steps, my rough plans are to:
- complete an end to end machine learning project and publish my progress on this blog and on Kaggle
- complete fast.ai [Deep Learning Part 1](http://course.fast.ai/)
- build a deep learning compute
- compete in at least one Kaggle competetion, hopefully joining a team to boost my learning

### Starting an E2E Project: Predicting the cause of Wildfires in the United States

Surveying the datasets available to work with on Kaggle, I was immediately attracted to a dataset describing [1.88M US Wildfires](https://www.kaggle.com/rtatman/188-million-us-wildfires) over 15 years [originally published](https://www.fs.usda.gov/rds/archive/Product/RDS-2013-0009.4/) by the US Forest Service. Between a geospatial component, high topical relevance, and personal interest in the subject, I decided that I'd focus my first end to end project on predicting the cause of a Fire given information available when the fire began. Given the limited information available in the data set, it's highly likely that integrating additional data such as historical weather or land use, will be essential to building a strong model.

This first blog post outlines covers the initial steps of preparing an environment and data to begin working with it. Since the project is still moving, I expect the content here to change - updates will be posted below:

- *2017/1/30* - initial post

### Some notes on Jupyter

This post, and work, was completed in a Jupyter notebook running in a docker container. While my process has been roughly cataloged [here](https://www.andrewmahon.info/blog/docker-compose-data-science), throughout the project, the container has been modified a bit to update too

## Prepare Environment

The first thing that we need to do is prepare our working environment

### Load Libraries

In [2]:
import itertools
import math
import os

from IPython.core.display import HTML
import geopandas as gpd
import graphviz
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import palettable
import pandas as pd
import pandas.tools.plotting as pdplot
import pprint
import seaborn as sns
import shapely
import sklearn
from sklearn import model_selection
import sqlite3

### Configure

In [3]:
%matplotlib inline

mpl.rcParams['figure.figsize'] = (15, 15)
mpl.rcParams['agg.path.chunksize'] = 100000

qual_colormap = palettable.matplotlib.Inferno_20
quant_colormap = palettable.matplotlib.Inferno_20_r

mpl.rcParams['image.cmap'] = qual_colormap.mpl_colormap
sns.set(rc={'figure.figsize':(15, 15)})
sns.set_palette(qual_colormap.mpl_colors)

## Load Data

Define location of data file, open SQLite connection, define query, read into DF. Since we dont have a good sense of what's in the data, let's load all columnns.

In [4]:
input_filename = '/data/188-million-us-wildfires/src/FPA_FOD_20170508.sqlite'
conn = sqlite3.connect(input_filename)

query = '''
    SELECT
        NWCG_REPORTING_AGENCY,
        NWCG_REPORTING_UNIT_ID,
        NWCG_REPORTING_UNIT_NAME,
        FIRE_NAME,
        COMPLEX_NAME,
        FIRE_YEAR,
        DISCOVERY_DATE,
        DISCOVERY_DOY,
        DISCOVERY_TIME,
        STAT_CAUSE_CODE,
        STAT_CAUSE_DESCR,
        CONT_DATE,
        CONT_DOY,
        CONT_TIME,
        FIRE_SIZE,
        FIRE_SIZE_CLASS,
        LATITUDE,
        LONGITUDE,
        OWNER_CODE,
        OWNER_DESCR,
        STATE,
        COUNTY
    FROM
        Fires;
'''

raw_df = pd.read_sql_query(query, conn)

## Review Raw Data

Now that our data is loaded, let's give it a very high level look and start to develop an understanding of what we're working with.

### Info
Let's have a look at our column names and the type of data in each column.

In [5]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 22 columns):
NWCG_REPORTING_AGENCY       object
NWCG_REPORTING_UNIT_ID      object
NWCG_REPORTING_UNIT_NAME    object
FIRE_NAME                   object
COMPLEX_NAME                object
FIRE_YEAR                   int64
DISCOVERY_DATE              float64
DISCOVERY_DOY               int64
DISCOVERY_TIME              object
STAT_CAUSE_CODE             float64
STAT_CAUSE_DESCR            object
CONT_DATE                   float64
CONT_DOY                    float64
CONT_TIME                   object
FIRE_SIZE                   float64
FIRE_SIZE_CLASS             object
LATITUDE                    float64
LONGITUDE                   float64
OWNER_CODE                  float64
OWNER_DESCR                 object
STATE                       object
COUNTY                      object
dtypes: float64(8), int64(2), object(12)
memory usage: 315.6+ MB


### Missing Values
Let's see how many values are missing in each column.

In [6]:
raw_df.isna().sum()

NWCG_REPORTING_AGENCY             0
NWCG_REPORTING_UNIT_ID            0
NWCG_REPORTING_UNIT_NAME          0
FIRE_NAME                    957189
COMPLEX_NAME                1875282
FIRE_YEAR                         0
DISCOVERY_DATE                    0
DISCOVERY_DOY                     0
DISCOVERY_TIME               882638
STAT_CAUSE_CODE                   0
STAT_CAUSE_DESCR                  0
CONT_DATE                    891531
CONT_DOY                     891531
CONT_TIME                    972173
FIRE_SIZE                         0
FIRE_SIZE_CLASS                   0
LATITUDE                          0
LONGITUDE                         0
OWNER_CODE                        0
OWNER_DESCR                       0
STATE                             0
COUNTY                       678148
dtype: int64

### Sample
Let's look at some sample data.

In [7]:
raw_df.sample(10)

Unnamed: 0,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,FIRE_NAME,COMPLEX_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,...,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE,COUNTY
1526538,ST/C&L,USTXHAS,Texas Forest Service - Henderson Area,HENDERSON - 423,,2011,2455797.5,236,1406.0,1.0,...,236.0,1800.0,1.0,B,31.726667,-94.355,13.0,STATE OR PRIVATE,TX,Shelby
316255,BLM,USCACDD,California Desert District,BORDER 58,,2005,2453732.5,362,2255.0,2.0,...,362.0,2312.0,0.1,A,32.577661,-116.62775,8.0,PRIVATE,CA,San Diego
1779172,ST/C&L,USNYNYX,Fire Department of New York,,,2014,2456742.5,85,1909.0,9.0,...,180.0,1941.0,0.1,A,41.3867,-73.8735,14.0,MISSING/NOT SPECIFIED,NY,PUTNAM
795654,ST/C&L,USWIWIS,Wisconsin Department of Natural Resources,,,1992,2448729.5,108,2140.0,2.0,...,108.0,2215.0,0.9,B,43.998542,-90.615036,14.0,MISSING/NOT SPECIFIED,WI,Monroe
1268659,ST/C&L,USKYKYS,Kentucky Division of Forestry,,,2008,2454504.5,39,1545.0,5.0,...,39.0,1545.0,30.0,C,37.56109,-82.23566,14.0,MISSING/NOT SPECIFIED,KY,Pike
793850,ST/C&L,USWIWIS,Wisconsin Department of Natural Resources,,,1995,2449836.5,119,1331.0,4.0,...,119.0,1444.0,2.75,B,44.744363,-89.7647,14.0,MISSING/NOT SPECIFIED,WI,Marathon
1397854,ST/C&L,USGAGAS,Georgia Forestry Commission,,,1996,2450124.5,42,2240.0,7.0,...,42.0,2310.0,0.8,B,32.0126,-83.0427,8.0,PRIVATE,GA,Telfair
493028,ST/C&L,USMNMNS,Minnesota Department of Natural Resources,,,2007,2454221.5,121,,9.0,...,,,0.25,A,45.854778,-93.919926,14.0,MISSING/NOT SPECIFIED,MN,Morrison
1113349,ST/C&L,USCANEU,Nevada-Yuba-Placer Unit,FAR WEST 2,,2001,2452126.5,217,,2.0,...,,,0.1,A,39.051111,-121.313056,14.0,MISSING/NOT SPECIFIED,CA,
1725060,FS,USIDBOF,Boise National Forest,LUCKY,,2014,2456846.5,189,6.0,3.0,...,189.0,1154.0,2.0,B,43.591389,-115.999722,13.0,STATE OR PRIVATE,ID,Ada


### Describe

In [8]:
raw_df.describe(include='all')

Unnamed: 0,NWCG_REPORTING_AGENCY,NWCG_REPORTING_UNIT_ID,NWCG_REPORTING_UNIT_NAME,FIRE_NAME,COMPLEX_NAME,FIRE_YEAR,DISCOVERY_DATE,DISCOVERY_DOY,DISCOVERY_TIME,STAT_CAUSE_CODE,...,CONT_DOY,CONT_TIME,FIRE_SIZE,FIRE_SIZE_CLASS,LATITUDE,LONGITUDE,OWNER_CODE,OWNER_DESCR,STATE,COUNTY
count,1880465,1880465,1880465,923276,5183,1880465.0,1880465.0,1880465.0,997827.0,1880465.0,...,988934.0,908292.0,1880465.0,1880465,1880465.0,1880465.0,1880465.0,1880465,1880465,1202317.0
unique,11,1640,1635,493633,1416,,,,1440.0,,...,,1441.0,,7,,,,16,52,3455.0
top,ST/C&L,USGAGAS,Georgia Forestry Commission,GRASS FIRE,OSAGE-MIAMI COMPLEX,,,,1400.0,,...,,1800.0,,B,,,,MISSING/NOT SPECIFIED,CA,5.0
freq,1377090,167123,167123,3983,54,,,,20981.0,,...,,38078.0,,939376,,,,1050835,189550,7576.0
mean,,,,,,2003.71,2453064.0,164.7191,,5.979037,...,172.656766,,74.52016,,36.78121,-95.70494,10.59658,,,
std,,,,,,6.663099,2434.573,90.03891,,3.48386,...,84.320348,,2497.598,,6.139031,16.71694,4.404662,,,
min,,,,,,1992.0,2448622.0,1.0,,1.0,...,1.0,,1e-05,,17.93972,-178.8026,0.0,,,
25%,,,,,,1998.0,2451084.0,89.0,,3.0,...,102.0,,0.1,,32.8186,-110.3635,8.0,,,
50%,,,,,,2004.0,2453178.0,164.0,,5.0,...,181.0,,1.0,,35.4525,-92.04304,14.0,,,
75%,,,,,,2009.0,2455036.0,230.0,,9.0,...,232.0,,3.3,,40.8272,-82.2976,14.0,,,


Some initial observations about the data:
- **STAT_CAUSE_CODE** and **STAT_CAUSE_DESCR** are related and represent the value that we are trying to predict. Before training, we'll drop **STAT_CAUSE_DESCR** in favor of the numerical value of **STAT_CAUSE_CODE**.
- **OWNER_CODE** and **OWNER_DESCR** are related and describe the owner of the property where the fire was discovered. This is an interesting value because it represents the land management and usage of a particular peice of land. These will be interesting in our investigation. Before training, we'll drop **OWNER_DESCR** in favor of the numerical value of **OWNER_CODE**.
- **DISCOVERY_DATE**, **DISCOVERY_DOY**, **DISCOVERY_TIME** describe the time that a fire was discovered. **DISCOVERY_DOY** is most interesting to our investigation due to it's relation to climate and usage patterns of a particular peice of land. **DISCOVERY_TIME** may be interesting due, but also might be too fine grained, additionally, it's missing values - let's drop it for now. **DISCOVERY_DATE** is [TKTK]
- **LATITUDE**, and **LONGITUDE** are both very interesting due to their very high relationship to land cover, land use, and climate - all three big factors in wildfire creation.
- **STATE** and  both categorically describe the location of a fire. **STATE** might be interesting due to it's relation in land use patterns. **STATE** also might prove to be a useful generalization of the more specific **LATITUDE**, and **LONGITUDE*.
- **COUNTY**, while potentially interesting, has too many missing values. If we want to more closely explore categorial location data, we can add it via a geocoding process in the data engineering process.
- A number of columns contain information about how a fire was addressed and not about what caused the fire. Lets' ignore the following columns for now: **NWCG_REPORTING_AGENCY**, **NWCG_REPORTING_UNIT_ID**, **NWCG_REPORTING_UNIT_NAME**, **FIRE_NAME**, **FIRE_COMPLEX**, **CONT_DATE**, **CONT_DOY**, **CONT_TIME**, **FIRE_SIZE**, **FIRE_SIZE_CLASS**

This leaves us with the following interesting fields:
- **STAT_CAUSE_CODE**
- **STAT_CAUSE_DESCR** [for EDA]
- **OWNER_CODE**
- **OWNER_DESCR**
- **DISCOVERY_DOY**
- **LATITUDE**
- **LONGITUDE**
- **STATE**

Some things we can keep in our back pocket for future exploration:
- look harder at **DISCOVERY_TIME**
- look harder at **DISCOVERY_DATE**

### Create Human Readable Mappings
Before we drop our human readable columns, let's create a set of mappings that we can use to associate numberical categories back to human readable categories.

In [9]:
stat_cause_mapping = raw_df \
    .groupby(['STAT_CAUSE_DESCR', 'STAT_CAUSE_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('STAT_CAUSE_CODE')\
    .sort_index()['STAT_CAUSE_DESCR']
stat_cause_mapping

STAT_CAUSE_CODE
1.0             Lightning
2.0         Equipment Use
3.0               Smoking
4.0              Campfire
5.0        Debris Burning
6.0              Railroad
7.0                 Arson
8.0              Children
9.0         Miscellaneous
10.0            Fireworks
11.0            Powerline
12.0            Structure
13.0    Missing/Undefined
Name: STAT_CAUSE_DESCR, dtype: object

In [10]:
owner_code_mapping = raw_df \
    .groupby(['OWNER_DESCR', 'OWNER_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('OWNER_CODE')\
    .sort_index()['OWNER_DESCR']
owner_code_mapping

OWNER_CODE
0.0                   FOREIGN
1.0                       BLM
2.0                       BIA
3.0                       NPS
4.0                       FWS
5.0                      USFS
6.0             OTHER FEDERAL
7.0                     STATE
8.0                   PRIVATE
9.0                    TRIBAL
10.0                      BOR
11.0                   COUNTY
12.0          MUNICIPAL/LOCAL
13.0         STATE OR PRIVATE
14.0    MISSING/NOT SPECIFIED
15.0        UNDEFINED FEDERAL
Name: OWNER_DESCR, dtype: object

## Strip Data

Let's create a new dataframe that contains only the fields that we're interested in. This will reduce memory usage and help keep things tidy. 

In [11]:
df = raw_df.copy()[[
    'STAT_CAUSE_CODE',
    'STAT_CAUSE_DESCR',
    'OWNER_CODE',
    'OWNER_DESCR',
    'DISCOVERY_DOY',
    'LATITUDE',
    'LONGITUDE',
    'STATE'
]]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 8 columns):
STAT_CAUSE_CODE     float64
STAT_CAUSE_DESCR    object
OWNER_CODE          float64
OWNER_DESCR         object
DISCOVERY_DOY       int64
LATITUDE            float64
LONGITUDE           float64
STATE               object
dtypes: float64(4), int64(1), object(3)
memory usage: 114.8+ MB


## Checkpoint [0]

Getting `gdf` to the state it is right now took some time. Let's checkpoint the file onto disk so we can come back to it later.

In [12]:
# df_checkpoint_path = '../data/188-million-us-wildfires/wildfires-cause-prediction-df_checkpoint_0.pickle'

# OVERWRITE = False

# if not os.path.exists(df_checkpoint_path) or OVERWRITE:
#     df.to_pickle(df_checkpoint_path)
# else :
#     df = pd.read_pickle(df_checkpoint_path)

## Split Data

We need to split the data into a train set and a test set. The train set will be used to build our model, and the test set will be used to evaluate the model.

We will use sklearn's [`model_selection.train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split our dataframe into two.

Last, we will create a convienience `_df` that allows us to access the union of the test and train sets.

In [13]:
train_df, test_df = model_selection.train_test_split(df)

display(HTML('''
<p>
    Number of Training Rows: {}<br />
    Number of Test Rows: {}
'''.format(train_df.shape[0], test_df.shape[0])))

_df = [train_df, test_df]

---

# END BLOG 1

---

# Blog Post 2

## Clean Data

### Quantify Data

A couple fields in our reduced set are still non-numeric, in particular the **STATE** field is still text - let's convert this to a numeric value.

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

In [13]:
label = sklearn.preprocessing.LabelEncoder()
for dataframe in _df:
    dataframe['STATE_CODE'] = label.fit_transform(dataframe['STATE'])

### Bin Data

Ref: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html  https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html

### Reproject Data

The latitude and longitude values in our input dataset are projected in the [NAD83 coordinate system](https://en.wikipedia.org/wiki/North_American_Datum). 

Ref: http://jswhit.github.io/pyproj/

In [None]:
import pyproj

## Explore Data [0] - Exploratory Data Analysis

1. Causes
1. Week of Year [todo]
1. Week of Year and Cause
2. Owner
3. Owner and Cause
4. State
5. State and Cause

### Causes

Let's explore the causes of wildfires represented in our dataset.

In [None]:
counts_by_cause = test_df.groupby('STAT_CAUSE_DESCR')\
    .size()\
    .sort_values(ascending=False)
counts_by_cause_pcts = counts_by_cause.apply(lambda x: 100 * x / float(counts_by_cause.sum()))

ax = sns.barplot(counts_by_cause.index, counts_by_cause.values)
ax.set_xticklabels(labels=counts_by_cause.index, rotation=90)

for i, p in enumerate(ax.patches):
    height = p.get_height()
    width = p.get_width()
    ax.text(
        p.get_x()+(width/2.),
        height + 1000,
        '{:1.2f}%'.format(counts_by_cause_pcts[i]),
        ha="center") 

plt.show()

### Day of Year

In [None]:
count_by_doy = train_df.groupby('DISCOVERY_DOY').size()
ax = count_by_doy.plot()
ax.set_xlim(0,367)
ax.set_ylim(0,10000)

### Day of Year and Cause

Hypothesis: [TKTK]

Process: Add 'DISCOVERY_WEEK' column to table. Note that we end up with 53 weeks as a result of Leap Years. I also added one to 1 index the list of weeks to better adhere with common understanding.

Process: Create a piviot table that relates `STAT_CAUSE_DESCR` to `DISCOVERY_WEEK`. Plot that using a seaborne heatmap.

In [None]:
cause_by_doy = train_df.groupby(['DISCOVERY_DOY', 'STAT_CAUSE_DESCR'])\
    .size()\
    .unstack()
causes = list(cause_by_doy.columns.values)
cause_by_doy['Total'] = cause_by_doy.sum(axis=1)
cause_by_doy_proportional = pd.DataFrame()
for cause in causes:
    cause_by_doy_proportional[cause] = cause_by_doy[[cause, 'Total']].apply(lambda x: x[cause]/x['Total'], axis=1)
cause_by_doy = cause_by_doy.drop('Total', axis=1)
cause_by_doy.head(10)

In [None]:
from cycler import cycler
ax = cause_by_doy.plot.area()
ax.set_xlim(0,367)
ax.set_ylim(0,10000)

In [None]:
ax = sns.heatmap(
    cause_by_doy,
    cbar_kws={'shrink':.9 },
    annot=False,
    cmap=quant_colormap.mpl_colormap
)
for i, label in enumerate(ax.yaxis.get_ticklabels()):
    label.set_visible(False)
    if i % 7 == 0:
        label.set_visible(True)

In [None]:
ax = sns.heatmap(
    cause_by_doy_proportional,
    cbar_kws={'shrink':.9 }, 
    annot=False,
    cmap=quant_colormap.mpl_colormap
)
for i, label in enumerate(ax.yaxis.get_ticklabels()):
    label.set_visible(False)
    if i % 7 == 0:
        label.set_visible(True)

Analysis: [TKTK]

### Owner

In [None]:
counts_by_owner = train_df.groupby('OWNER_DESCR')\
    .size()\
    .sort_values(ascending=False)

ax = sns.barplot(counts_by_owner.index, counts_by_owner.values)
labels = ax.set_xticklabels(labels=counts_by_owner.index, rotation=90)

### Cause and Owner

Add 'DISCOVERY_WEEK' column to table. Note that we end up with 53 weeks as a result of Leap Years. I also added one to 1 index the list of weeks to better adhere with common understanding.

In [None]:
cause_by_week = train_df.groupby(['OWNER_DESCR', 'STAT_CAUSE_DESCR'])\
    .size()\
    .unstack()

ax = sns.heatmap(
    cause_by_week,
    cbar_kws={'shrink':.9 }, 
    annot=False,
    cmap='inferno_r'
)

### State

In [None]:
counts_by_state = train_df.groupby('STATE')\
    .size()\
    .sort_values(ascending=False)

ax = sns.barplot(counts_by_state.index, counts_by_state.values)
labels = ax.set_xticklabels(labels=counts_by_state.index, rotation=90)

### State, Geographic

Load our State outlines, join in some abbreviations.

Outlines sourced from: http://eric.clst.org/tech/usgeojson/

#### Create States Dataframe containing Border Geometries

In [None]:
state_outlines_path = '/data/188-million-us-wildfires/src/gz_2010_us_040_00_500k.json'
state_outlines_df = gpd.read_file(state_outlines_path)

state_codes_path = '/data/188-million-us-wildfires/src/state_codes.json'
state_codes_df = pd.read_json(state_codes_path, orient='records').set_index('name')
states = state_outlines_df\
    .join(state_codes_df, on='NAME').set_index('alpha-2')\
    .join(counts_by_state.to_frame().rename(columns={0:'count'}))
states.sample(5)

In [None]:
states.to_crs({'init': 'epsg:3395'}).plot(column='count', cmap='inferno')

### State and Cause

In [None]:
cause_by_state = train_df.groupby(['STATE', 'STAT_CAUSE_DESCR'])\
    .size()\
    .unstack()
causes = list(cause_by_state.columns.values)
cause_by_state['Total'] = cause_by_state.sum(axis=1)
cause_by_state_proportional = pd.DataFrame()
for cause in causes:
    cause_by_state_proportional[cause] = cause_by_state[[cause, 'Total']].apply(lambda x: x[cause]/x['Total'], axis=1)
cause_by_state = cause_by_state.drop('Total', axis=1)

ax = sns.heatmap(
    cause_by_state,
    cbar_kws={'shrink':.9 }, 
    annot=False,
    cmap='inferno_r'
)

### Pearson Coorelation

Ref: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [None]:
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    #colormap = sns.diverging_palette(220, 10, as_cmap = True)
    colormap = quant_colormap.mpl_colormap
    
    _ = sns.heatmap(
        df.corr(), 
        square=True,
        cmap = colormap,
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,
        vmax=1.0,
        linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(train_df)

---
# END BLOG 2

---

## Engineer Data

### Add Missing Data

In [None]:
# ENGINEER LOCATION OBJECT

# geometry = [shapely.geometry.Point(xy) for xy in zip(df.LONGITUDE, df.LATITUDE)]
# df.drop(['LONGITUDE', 'LATITUDE'], axis=1, inplace=True)
# crs = {'init': 'epsg:4269'}
# gdf = gpd.GeoDataFrame(df, crs=crs, geometry=geometry)
# del df
# gdf = gdf.to_crs({'init': 'epsg:4326'})

# print(gdf.info())
# gdf.sample(5)

### Discovery Week of Year

In [None]:
for dataframe in _df:
    dataframe['DISCOVERY_WEEK'] = dataframe['DISCOVERY_DOY']\
        .apply(lambda x: math.floor(x/7) + 1)

### Climate Regions



Ref: https://www.ncdc.noaa.gov/monitoring-references/maps/us-climate-regions.php

In [None]:
#CLIMATE REGION
climate_regions = {
    "northwest": [
        "WA",
        "OR",
        "ID"
    ],
    "west": [
        "CA",
        "NV",
    ],
    "southwest": [
        "UT",
        "CO",
        "AZ",
        "NM",
    ],
    "northern_rockies": [
        "MT",
        "ND",
        "SD",
        "WY",
        "NE",
    ],
    "upper_midwest": [
        "KS",
        "OK",
        "TX",
        "AR",
        "LA",
        "MS",
    ],
    "south": [
        "MN",
        "WI",
        "MI",
        "IA"
    ],
    "ohio_valley": [
        "MO",
        "IL",
        "IN",
        "OH",
        "WV",
        "KY",
        "TN",
    ],
    "southeast": [
        "VA",
        "NC",
        "SC",
        "GA",
        "AL",
        "FL",
    ],
    "northeast": [
        "ME",
        "NH",
        "VT",
        "NY",
        "PA",
        "MA",
        "RI",
        "CT",
        "NJ",
        "DE",
        "MD",
        "DC"
    ],
    "alaska": [
        "AK",
    ],
    "hawaii": [
        "HI"
    ],
    "puerto_rico": [
        "PR"
    ]
}

state_region_mapping = {}
for region, region_states in climate_regions.items():
    for state in region_states:
        state_region_mapping[state] = region
        
for dataframe in _df:
    dataframe['CLIMATE_REGION'] = dataframe['STATE']\
        .apply(lambda x: state_region_mapping[x])
        
label = sklearn.preprocessing.LabelEncoder()
for dataframe in _df:
    dataframe['CLIMATE_REGION_CODE'] = label.fit_transform(dataframe['CLIMATE_REGION'])

train_df.sample(10)

### H3 Binning

Ref:

https://github.com/uber/h3

https://github.com/uber/h3/blob/master/docs/doxyfiles/restable.md

In [None]:
import subprocess
from tqdm import tqdm

def h3_for_chunk(chunk, precision):
    lat_lon_lines = "\n".join(["{} {}".format(row['LATITUDE'], row['LONGITUDE']) for index,row in chunk.iterrows()])
    h3 = subprocess.run(
        ['/tools/h3/bin/geoToH3', str(precision)],
        input=str.encode(lat_lon_lines),
        stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
    return pd.DataFrame({
        "h3_{}".format(precision): h3
    }, index=chunk.index)

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

h3_df = pd.DataFrame()
with tqdm(total=_df[0].shape[0]) as pbar:
    for chunk in chunker(_df[0], 10000):
        h3_df = pd.concat([h3_df, h3_for_chunk(chunk, 5)])
        pbar.update(chunk.shape[0])
_df[0] = _df[0].join(h3_df)

h3_df = pd.DataFrame()
with tqdm(total=_df[1].shape[1]) as pbar:
    for chunk in chunker(_df[1], 10000):
        h3_df = pd.concat([h3_df, h3_for_chunk(chunk, 5)])
        pbar.update(chunk.shape[1])
_df[1] = _df[1].join(h3_df)

In [14]:
for dataframe in _df:
    display(dataframe.sample(10))

Unnamed: 0,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,OWNER_CODE,OWNER_DESCR,DISCOVERY_DOY,LATITUDE,LONGITUDE,STATE
1250645,8.0,Children,8.0,PRIVATE,58,31.8239,-82.4981,GA
157543,1.0,Lightning,5.0,USFS,207,37.345,-106.9725,CO
1844989,9.0,Miscellaneous,7.0,STATE,164,45.7299,-122.3629,WA
216000,9.0,Miscellaneous,1.0,BLM,254,34.65,-117.2342,CA
1673494,9.0,Miscellaneous,14.0,MISSING/NOT SPECIFIED,63,32.4288,-85.754428,AL
281668,7.0,Arson,2.0,BIA,265,41.0832,-123.6845,CA
1853418,13.0,Missing/Undefined,14.0,MISSING/NOT SPECIFIED,76,32.194,-110.8348,AZ
1822914,5.0,Debris Burning,8.0,PRIVATE,90,34.295269,-80.931343,SC
501464,7.0,Arson,14.0,MISSING/NOT SPECIFIED,115,45.792378,-93.433368,MN
339463,1.0,Lightning,8.0,PRIVATE,209,32.65293,-104.92554,NM


Unnamed: 0,STAT_CAUSE_CODE,STAT_CAUSE_DESCR,OWNER_CODE,OWNER_DESCR,DISCOVERY_DOY,LATITUDE,LONGITUDE,STATE
1305567,2.0,Equipment Use,14.0,MISSING/NOT SPECIFIED,323,43.059367,-108.361091,WY
1320505,3.0,Smoking,8.0,PRIVATE,90,39.012455,-79.721108,WV
1305735,4.0,Campfire,14.0,MISSING/NOT SPECIFIED,118,42.766171,-108.797927,WY
1276287,2.0,Equipment Use,8.0,PRIVATE,251,45.07925,-90.51273,WI
796473,7.0,Arson,14.0,MISSING/NOT SPECIFIED,121,43.596278,-89.869891,WI
63249,1.0,Lightning,5.0,USFS,229,47.583333,-115.035,MT
715850,8.0,Children,12.0,MUNICIPAL/LOCAL,258,46.17575,-123.82908,OR
1017609,2.0,Equipment Use,14.0,MISSING/NOT SPECIFIED,117,35.6583,-81.4217,NC
7476,7.0,Arson,14.0,MISSING/NOT SPECIFIED,68,30.644444,-89.135,MS
983899,9.0,Miscellaneous,14.0,MISSING/NOT SPECIFIED,146,29.24,-82.45,FL


In [None]:
for dataframe in _df:
    display(dataframe.h3_5.unique())

### Geohash

In [None]:
import geohash

def geohashes_for_geometry(geometry, precision=[7, 6, 4, 3]):
    _lon = geometry.coords.xy[0][0]
    _lat = geometry.coords.xy[1][0]
    _geohash = geohash.encode(_lat, _lon, precision=precision[0])
    _output = [_geohash[0:p] for p in precision]
    return pd.Series(_output)


for dataframe in _df:
    geohashes = ['geometry'].apply(geohashes_for_geometry, precision=[4])
geohashes = geohashes.rename(columns={0: 'geohash_5'})

## Explore Data [1]

Now that we have engineered some new data, let's revisit some of our exploratory techniques.

### Climate Region

In [None]:
cause_by_region = train_df.groupby(['CLIMATE_REGION', 'STAT_CAUSE_DESCR'])\
    .size()\
    .unstack()

ax = sns.heatmap(
    cause_by_region,
    cbar_kws={'shrink':.9 }, 
    annot=False,
    cmap='inferno_r'
)

### Pearson Coorelation

Ref: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [None]:
#correlation heatmap of dataset
def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    #colormap = sns.diverging_palette(220, 10, as_cmap = True)
    colormap = quant_colormap.mpl_colormap
    
    _ = sns.heatmap(
        df.corr(), 
        square=True,
        cmap = colormap,
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,
        vmax=1.0,
        linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(train_df)

## Convert Data

Drop text category fields that duplicate numerical category fields.

In [None]:
# clean_train_df = train_df.drop([
#     'STAT_CAUSE_DESCR',
#     'OWNER_DESCR',
#     'STATE',
#     'CLIMATE_REGION',
# ], axis=1)
# clean_train_df.info()

### Confusion Matrix Helper

Ref:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion Matrix',
                          cmap=quant_colormap.mpl_colormap):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar(shrink=0.65)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    fmt = '.2f' if normalize else 'd'
    thresh =  cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='white' if cm[i, j] > thresh else 'black')
    
    plt.tight_layout()
    plt.xlabel('Actual label')
    plt.ylabel('Predicted label')

## Create Benchmark

Before we train a model, let's establish a benchmark to work against. To do this, we will develop by hand, a very simple decision tree.

The simplest benchmark that we can set is with a model that always evaluates to a single response. Our benchmark model will evaluate to `5.0`, the code representative of 'Debris Burning'.

TODO: this benchmark should be improved once we have engineered some new data into our dataset

In [None]:
source_fields = [
    'OWNER_CODE',
    'DISCOVERY_WEEK',
    'STATE_CODE'
]

target_fields = [
    'STAT_CAUSE_CODE'
]

def benchmark_model(df):
    return pd.DataFrame(data = {'STAT_CAUSE_CODE_PREDICT':[5.0 for i in range(0, df.shape[0])]})

prediction = benchmark_model(test_df)

Test the accuracy of our benchmark model. It should fall right around 23% - this is the rough proportion of fires caused by 'Debris Burning'.

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score  http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report

In [None]:
accuracy = sklearn.metrics.accuracy_score(
    test_df[target_fields],
    prediction
)

print('Accuracy: {} \n\n'.format(accuracy))

print(sklearn.metrics.classification_report(
    test_df[target_fields],
    prediction
))

confusion_matrix = sklearn.metrics.confusion_matrix(test_df[target_fields], prediction)

plt.figure()
plot_confusion_matrix(
    confusion_matrix,
    stat_cause_mapping.values,
    normalize=True,
    title='Normalized confusion matrix'
)
plt.show()

## Train Model

### Stochastic Gradient Descent (SGD) Classifier

[TKTK]


Ref:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

http://scikit-learn.org/stable/modules/sgd.html#classification

In [None]:
from sklearn.linear_model import SGDClassifier

source_fields = [
    'OWNER_CODE',
    'DISCOVERY_WEEK',
    'STATE_CODE'
]

target_fields = [
    'STAT_CAUSE_CODE'
]

classifier = SGDClassifier(
    loss="hinge",
    penalty="l2",
    max_iter=5,
    tol=None,
)

classifier.fit(
    train_df[source_fields],
    train_df[target_fields]['STAT_CAUSE_CODE']
)

prediction = classifier.predict(
    test_df[source_fields]
)

accuracy = sklearn.metrics.accuracy_score(
    test_df[target_fields],
    prediction
)

print('Accuracy: {} \n\n'.format(accuracy))

print(sklearn.metrics.classification_report(
    test_df[target_fields],
    prediction
))

confusion_matrix = sklearn.metrics.confusion_matrix(test_df[target_fields], prediction)

plt.figure()
plot_confusion_matrix(
    confusion_matrix,
    stat_cause_mapping.values,
    normalize=True,
    title='Normalized confusion matrix'
)
plt.show

### Decision Tree Classifier

Ref:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
from sklearn.tree import DecisionTreeClassifier

source_fields = [
    'OWNER_CODE',
    'DISCOVERY_WEEK',
    'STATE_CODE'
]

target_fields = [
    'STAT_CAUSE_CODE'
]

classifier = DecisionTreeClassifier(
    random_state=0
)

classifier.fit(
    train_df[source_fields],
    train_df[target_fields]
)

prediction = classifier.predict(
    test_df[source_fields]
)

accuracy = sklearn.metrics.accuracy_score(
    test_df[target_fields],
    prediction
)

print('Accuracy: {} \n\n'.format(accuracy))

print(sklearn.metrics.classification_report(
    test_df[target_fields],
    prediction
))

confusion_matrix = sklearn.metrics.confusion_matrix(test_df[target_fields], prediction)

plt.figure()
plot_confusion_matrix(
    confusion_matrix,
    stat_cause_mapping.values,
    normalize=True,
    title='Normalized confusion matrix'
)
plt.show

## Decision Tree Classifier Redux

Ref:

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [None]:
source_fields = [
    'OWNER_CODE',
#     'DISCOVERY_WEEK',
    'STATE_CODE'
]

target_fields = [
    'STAT_CAUSE_CODE'
]

cv_split = sklearn.model_selection.ShuffleSplit(
    n_splits=10,
    test_size=0.3,
    train_size=0.6,
    random_state=0
)

### Base Classifier

In [None]:
base_classifier = DecisionTreeClassifier(
    class_weight=None,
    criterion='gini',
    max_depth=10,
    max_features=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    min_samples_leaf=1,
    min_samples_split=2,
    min_weight_fraction_leaf=0.0,
    presort=False,
    random_state=0,
    splitter='best'
)

base_results = model_selection.cross_validate(
    base_classifier,
    train_df[source_fields],
    train_df[target_fields],
    cv = cv_split
)

base_classifier.fit(
    train_df[source_fields],
    train_df[target_fields],
)

In [None]:
print(base_classifier.tree_.node_count)

In [None]:
pprint.pprint(base_classifier.get_params())
print("Base Score: {}".format(base_results['test_score'].mean()))

In [None]:
dot_data = sklearn.tree.export_graphviz(
    base_classifier,
    out_file=None, 
    feature_names=source_fields,
    class_names=True,
    filled=True,
    rounded = True
)
print(dot_data)
graph = graphviz.Source(dot_data, engine='sfdp') 
graph

### Hyperparameter Tuning

Setup hyperparameter grid for decision tree. Using GridSearchCV, we will try various combinations of the parameters that we define.

Ref:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter



In [None]:
hyperparameter_grid = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best"],
    "max_depth": [17, 19, 20, 22, 25],
    "min_samples_split": [20, 40, 60],
    "min_samples_leaf": [1, 2, 4, 8],
    "min_weight_fraction_leaf": [0],
    "max_features": [None],
    "random_state": [0],
    "max_leaf_nodes": [None],
    "min_impurity_decrease": [0.],
    "class_weight": [None]
}

tuned_classifier = sklearn.model_selection.GridSearchCV(
    DecisionTreeClassifier(random_state=0),
    hyperparameter_grid,
    scoring='accuracy',
    cv=cv_split
)

tuned_classifier.fit(
    train_df[source_fields],
    train_df[target_fields]
)

In [None]:
pprint.pprint(tuned_classifier.best_params_)
print("Tuned Score: {}".format(tuned_classifier.cv_results_['mean_test_score'][tuned_classifier.best_index_]))


tuned_prediction = classifier.predict(
    test_df[source_fields]
)

tuned_accuracy = sklearn.metrics.accuracy_score(
    test_df[target_fields],
    tuned_prediction
)

print('Accuracy: {} \n\n'.format(tuned_accuracy))

print(sklearn.metrics.classification_report(
    test_df[target_fields],
    tuned_prediction
))

confusion_matrix = sklearn.metrics.confusion_matrix(test_df[target_fields], tuned_prediction)

plt.figure()
plot_confusion_matrix(
    confusion_matrix,
    stat_cause_mapping.values,
    normalize=True,
    title='Normalized confusion matrix'
)
plt.show

### Feature Elimination

[TKTK]

Ref:




In [None]:
rfe_classifier = sklearn.feature_selection.RFECV(
    base_classifier,
    step=1,
    scoring='accuracy',
    cv=cv_split
)

rfe_classifier.fit(
    train_df[source_fields],
    train_df[target_fields]
)

rfe_support_columns = train_df[source_fields].columns.values[rfe_classifier.get_support()]

rfe_results = model_selection.cross_validation(
    base_classifier,
    train_df[rfe_support_columns],
    train_df[target_fields],
    cv=cv_split
)

print(rfe_results)

rfe_tuned_classifier = sklearn.model_selection.GridSearchCV(
    DecisionTreeClassifier(random_state=0),
    hyperparameter_grid,
    scoring='accuracy',
    cv=cv_split
)

rfe_tuned_classifier.fit(
    train_df[rfe_support_columns],
    train_df[target_fields]
)

print("RFE + Tuned Score: {}".format(rfe_tuned_classifier.cv_results_['mean_test_score'][classifier.best_index_]))

## Evaluate Model