<a href="https://colab.research.google.com/github/Ujustwaite/ml1/blob/master/Lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 2: Classification Using New York City Fire Department Data:**


### Team: Aditya Garapati, Chase Henderson, Brian Waite, Carl Walenciak

### Date: 10/24/2019




# General Description:

## Business Understanding:

This is a continuation of the analysis of the Fire Department of New York City (FDNY) data describing fire incidents in support of the New York Fire Incident Reporting System (NYFIRS). Prior to and during our initial data preparation, we identified a number of analytic questions that could be of interest to fire resource planners. 

New York City uses a series of alarm codes to identify the severity of a fire and the associated response. These alarm codes are described at the following web locations: 

http://www.fdnewyork.com/aa.asp

https://en.wikipedia.org/wiki/New_York_City_Fire_Department#Radio_and_bell_code_signals


## Problem Statement 1: Predicting High-Alarm vs. Low-Alarm Fires



Build a classification model that can **predict whether a fire constitutes a severe, high level incident, or a less-severe fire** based on parameters contained in the available data set. 

By building this classification model, we seek not only to identify the level of incidents based on parameters, but also to identify factors that contribute to an incident being classified as a high-level / low-level incident in order to aid planners in their future decision making process. 

### Approach: 

This analysis builds upon the work previously submitted in the Mini-Lab assignment. In additon to using logistic regression and support vector machine models, we expand our analysis to include a Random Forest Classifier. With each we provide appropriate interpretation of the model parameters to understand the factors that contribute to the classification decision. 

For those models (Logistic Regression and SVM) that were used in the Mini-Lab, we will carry forward the optimization that we performed to gain best performance. This is typically the model with the best performance after performing a GridSearch on a variety of model parameters. Those "best fit" model parameters will be carried forward and run once here. 

For models not previously included in other assignments, the full GridSearch / optimization process is displayed and model performance evaluation according to selected criteria is described. 

## Problem Statement 2: To be determined

Insert Once Identified

### Approach: 

Insert Once Identified

# Data Preparation: 

## Data Description: 

A robust description of the data has been previously provided along with an associated Exploratory Data Analysis. That information is available here for reference: https://colab.research.google.com/github/Ujustwaite/ml1/blob/master/Playing_With_Fire.ipynb

## Data Cleaning: 

To optimize this analysis, we need to do some additional transformation of the data and some leftover housekeeping from our EDA. This includes: 

* Conversion of the presence of the Automatic Extinguisher System to either "not present = 0" or "present = 1". 

* Conversion of the presence of a fire dector to either "not present = 0" or "present = 1". 

* Conversion of the presence of a standpipe system to "not present = 0" or "present = 1". 

* Filling of missing `total_incident_duration` values with the mean value. **Note that this process has been updated to take place after the separation of the Train / Test data in order to not pollute the model with advance knowledge of the Test data set.**

* Correction of two incorrect zip code values. 

* Dropping of categorical fields that are freely input by the user and are unusable for analysis or are not consistently used in a meaningful sense for this problem. 

In [4]:
!pip install category_encoders



In [0]:
#Imports Section
import category_encoders as ce
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

#Read in the data output from EDA step
final_df = pd.read_pickle('final_df.pkl')


In [0]:
#DATA CLEANING BLOCK

#AES presence update
final_df.loc[final_df['aes_presence_desc'] == '1 - Present', 'aes_presence_desc'] = 1
final_df.loc[final_df['aes_presence_desc'] != 1, 'aes_presence_desc'] = 0

#Smoke Detector presence update
final_df.loc[final_df['detector_presence_desc'] == '1 - Present', 'detector_presence_desc'] = 1
final_df.loc[final_df['detector_presence_desc'] != 1, 'detector_presence_desc'] = 0

#Standpipe presence update
final_df.loc[final_df['standpipe_sys_present_flag'] == '1', 'standpipe_sys_present_flag'] = 1
final_df.loc[final_df['standpipe_sys_present_flag'] != 1, 'standpipe_sys_present_flag'] = 0

#Replacement of missing zip codes
#Identified values using google maps and intersection information
final_df.at[18695, 'zip_code'] = '11103'
final_df.at[18760, 'zip_code'] = '11357'

#Drop the categorical columns determined to be unusable
final_df = final_df.drop(columns = ['fire_spread_desc','floor','story_fire_origin_count'])

#Drop the remaining 19 values that are missing the highest level alarm description
final_df.dropna(inplace = True)

### Problem 1: Construction of a Target Variable: 

Because the target variable is not currently in the data set, we must construct it with the existing available data set. 

Here we will take the existing feature, `highest_level_desc` that defines the level of alarm raised for each incident in the data set and convert it to a binary value. The values contained in the data set are: 

In [7]:
final_df.highest_level_desc.unique()

array(['11 - First Alarm', '75 - All Hands Working',
       '1 - More than initial alarm, less than Signal 7-5',
       '7 - Signal 7-5', '2 - 2nd alarm', '0 - Initial alarm',
       '22 - Second Alarm', '5 - 5th alarm', '4 - 4th alarm',
       '3 - 3rd alarm', '55 - Fifth Alarm', '33 - Third Alarm',
       '44 - Fourth Alarm'], dtype=object)

As you might expect, the majority of the events occuring throughout the city are low-level fires.

In [8]:
grouped = final_df.groupby(['highest_level_desc'])
grouped.count().im_incident_key

highest_level_desc
0 - Initial alarm                                       19
1 - More than initial alarm, less than Signal 7-5    12963
11 - First Alarm                                     11873
2 - 2nd alarm                                           53
22 - Second Alarm                                       47
3 - 3rd alarm                                           13
33 - Third Alarm                                        11
4 - 4th alarm                                            4
44 - Fourth Alarm                                        5
5 - 5th alarm                                            5
55 - Fifth Alarm                                         5
7 - Signal 7-5                                         701
75 - All Hands Working                                 621
Name: im_incident_key, dtype: int64

This means we have the potential for the data of "in class" vs. "out of class" to be highly imbalanced. We'll address this later in our analysis. For our analysis, we determined based on the definitions of the alarms in the references provided and on our initial EDA, that a severe fire is level 2 including Signal 7 / Signal 75, which are not truly 2 alarm or higher, but help to balance the in class data set. 

In [0]:
#Split the alarm code off the front of the description
new = final_df["highest_level_desc"].str.split(" ", n = 1, expand = True) 
#Convert to integer
new[0] = new[0].astype('int32')
#Map the classifications according to alarm code
desc = {2: 1,22: 1, 3:1,33:1,4:1,44:1,5:1,55:1,0:0,1:0,11:0,7:1,75:1,} 
final_df['FireLevel'] = [desc[item] for item in new[0]] 

In [10]:
#Number of in class records
final_df.FireLevel.sum()

1465

In [11]:
#Total number of records
final_df.FireLevel.shape[0]

26320

Our target variable is now contained in the data frame as `FireLevel`. As we can see, there are 1,467 in class values out of the total 26,322 records. Approximately 5.6 percent. We will monitor and adjust for this imbalance throughout our analysis. 

## Encoding of Categorical Predictors

In order to ensure the categorical values are providing balanced contributions to the model, we leverage one-hot encoding. This significantly increases the number of features. 

In [0]:
#Encode borough description
label = ce.OneHotEncoder(use_cat_names=True)
borough_label = label.fit_transform(final_df[['borough_desc']])

#Encode incident type
label = ce.OneHotEncoder(use_cat_names=True)
incident_type_label = label.fit_transform(final_df[['incident_type_desc']])

#Encode actiontaken 1 label -- the primary action taken by units onscene
label = ce.OneHotEncoder(use_cat_names=True)
action_taken_1_label = label.fit_transform(final_df[['action_taken1_desc']])

## Drop Columns Not to be Used in Analysis

A number of the columns in the dataframe are now duplicative of the encoded columns or are redundant / not useful. Things like zip_code, street_highway, and nearest_intersection are already captured in lat/long that we'll retain. Some of the date time information is already captured in the total_incident_duration feature. 

In [0]:
#drop the unnecessary columns
model_df = final_df.drop(columns = ['action_taken1_desc', 'action_taken2_desc', 'action_taken3_desc', 'borough_desc','fire_box', 'highest_level_desc', 'im_incident_key', 'incident_date_time', 'incident_type_desc', 'last_unit_cleared_date_time', 'property_use_desc', 'street_highway', 'zip_code', 'nearest_intersection','incident_code', 'incident_desc', 'DATE'])

In [0]:
#concatenate the encoded columns
model_df = pd.concat([model_df,incident_type_label, action_taken_1_label, borough_label], axis = 1)

In [0]:
#Move the target value to the end of the dataframe and rename as target
model_df['target'] = model_df['FireLevel']
model_df = model_df.drop(columns = 'FireLevel')

## Create Train / Test Split

In order to prepare for the analysis, we will use a shuffled 80/20 Train-Test split with stratification due to the small number of representative samples of the "in class" variables described above. The data is randomly selected for inclusion in either split, with a set seed to enable replication of results, but is stratified to ensure representation in both the train and test sets of the in class records. 

In [0]:
X_p1_train, X_p1_test, y_p1_train, y_p1_test = train_test_split(model_df.iloc[:,0:99], model_df.iloc[:,99], test_size=0.20, random_state=42, shuffle = True, stratify = model_df.iloc[:,99])

### Imputation of the 'total_incident_duration' feature: 

One error we made in the Mini-Lab was to pollute the training data by imputing on the entire data set prior to splitting into train and test. The below corrects this by imputing the 'total_incident_duration' on both the train and test sets separately using their respective mean values independently. 

In [0]:
#Total Incident Duration imputation
X_p1_train['total_incident_duration'].fillna((X_p1_train['total_incident_duration'].mean()), inplace = True)
X_p1_test['total_incident_duration'].fillna((X_p1_test['total_incident_duration'].mean()), inplace = True)


### Upsampling to obtain balanced records

To address the large imbalance in the data set, we successfully implemented code to up-sample the "in-class" records in the data set to be equally present in the data set with "out of class" records. 

In [25]:
from sklearn.utils import resample

# concatenate our training data back together
X = pd.concat([X_p1_train, y_p1_train], axis=1)

# separate minority and majority classes
not_severe = X[X.target==0]
severe = X[X.target==1]

# upsample minority
severe_upsampled = resample(severe, replace=True, # sample with replacement
                          n_samples=len(not_severe), # match number in majority class
                          random_state=123) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([not_severe, severe_upsampled])

# check new class counts
upsampled.target.value_counts()

1    19884
0    19884
Name: target, dtype: int64

This same process can be used by using the class_weight = 'balanced' flag when creating the model objects, but we wanted to execute the up-sampling to understand the process for doing so. 

Reference: https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18