# Generating Fire Safety Complaint Features
## Introduction
This notebook documents the process of generating feature data from the file `matched_Fire_Safety_Complaints.csv`

## Loading the File

In [3]:
import pandas

PATH_TO_CSV = "./data/matched_Fire_Safety_Complaints.csv"
complaints = pandas.read_csv(PATH_TO_CSV)

## Examining and Grouping the Complaint Item Types
Using Python's `Counter`, accumulate all the types of complaints, and the number of each type, in this dadaset.

In [4]:
from collections import Counter

complaint_types = Counter(complaints["Complaint Item Type Description"])
complaint_types

Counter({nan: 2,
         'fire escape': 122,
         'blocked exits': 3593,
         'crisp complaint inspection': 32,
         'unlicensed auto repair': 116,
         'refused hood + duct service': 49,
         'sprinkler/standpipe systems': 995,
         'roof access': 194,
         'ul cert verification': 445,
         'unapproved place of assembly': 12,
         'uncategorized complaint': 9838,
         'elevators not working': 217,
         'multiple fire code violations': 620,
         'overcrowded place of assembly': 243,
         'street numbering': 141,
         'exit maintenance': 147,
         'extinguishers': 1611,
         'electrical systems': 128,
         'weeds and grass': 634,
         'operating without a permit': 577,
         'hoarding': 108,
         'alarm systems': 9683,
         'leaking underground tanks': 4,
         'illegal occupancy': 37,
         'open vacant building': 112,
         'general hazardous materials': 343,
         'combustible materials': 

By looking at the total types, it seems that there are two complaints that have a type "nan", which appear to be complaints that weren't specified any type. We can drop these two.
Several other complaint types seem to be more related to the safety of evacuation in case of fire (e.g. "blocked exits"), which are not related to the risk of fire happening. (We may drop these?)
Furthermore, we may consider grouping similar complaint types.
First, we may group many of the complaints into three categories: "potential fire cause", "potential fire control", and "fire emergency safety".
* Potential Fire Cause: 
    * general hazardous materials
    * leaking underground tanks
    * hoarding
    * combustible materials
    * weeds and grass
    * open vacant building
    * refused hood + duct service
    * electrical systems
    * unlicensed auto repair
* Potential Fire Control:
    * alarm systems
    * sprinkler/standpipe systems
    * extinguishers
* Fire Emergency Safety
    * roof access
    * unapproved place of assembly
    * fire escape
    * blocked exits
    * exit maintenance
    * street numbering
    * overcrowded place of assembly
    
In addition, I'm not exactly sure how to categorize the following complaint types:
* ul cert verification
* crisp complaint inspection
* uncategorized complaint
* multiple fire code violations
* illegal occupancy
* operating without a permit

## The Disposition Column

The possible values in this column can be acquired with the following step:

In [5]:
disposition_types = Counter(complaints["Disposition"])
disposition_types

Counter({nan: 2355,
         'referred to another agency': 360,
         'no jurisdiction': 199,
         'violation issued': 5627,
         'referred to pm inspection task force': 19,
         'no access to building': 51,
         'condition corrected': 17001,
         'referred to dph': 12,
         'duplicate complaint': 84,
         'referred to dbi': 225,
         'no merit': 5526})

There are 11 types of disposition values. The type 'no merit' and the type 'duplicate complaint' may indicate that the complaint is not valid, therefore we may ignore such complaints in the final output. 
There are also 2355 complaints where the disposition value is NaN. These complaints are preserved and counted towards the final output at this moment.

## Generating Output

First, define a few functions to check the values of each complaint:

In [5]:
from datetime import date

def is_valid_complaint(row):
    disposition = row["Disposition"]
    return not (disposition == "no merit" or disposition == "duplicate complaint")

def is_corrected(row):
    disposition = row["Disposition"]
    return disposition == "condition corrected"

def parse_date(date_str):
    """For a string in the format of YYYY-MM-DD, 
    return (YYYY, MM, DD)"""
    return tuple(map(int, date_str.split('-')))

def is_within_date_range(row, min_date_str, max_date_str):
    """checks if beg <= row["Received Date"] <= end
    row: a row in the dataset, representing one complaint
    min_date_str: a str representing the beginning of the date range
    max_date_str: a str representing the end of the date range
    """
    complaint_date = date(*parse_date(row["Received Date"]))
    min_date = date(*parse_date(min_date_str))
    max_date = date(*parse_date(max_date_str))
    
    return min_date <= complaint_date and max_date >= complaint_date

Next, define a mapping from a complaint description to a more general complaint category, following the previous observation. 

In [6]:
# get the mappting from Complaint Item Type Description to Complaint Item Type
complaint_id_mapping = {}

for i, r in complaints.iterrows():
    dsc = r["Complaint Item Type Description"]
    complaint_id = r["Complaint Item Type"]
    if dsc in complaint_id_mapping:
        if complaint_id_mapping[dsc] != complaint_id:
            raise Exception("Complaint Type has different IDs")
    else:
        complaint_id_mapping[dsc] = complaint_id

complaint_id_mapping

{nan: 'unk',
 'fire escape': '22',
 'blocked exits': '02',
 'crisp complaint inspection': '16',
 'unlicensed auto repair': '04',
 'unapproved place of assembly': '24',
 'sprinkler/standpipe systems': '19',
 'roof access': '03',
 'ul cert verification': '17',
 'refused hood + duct service': '12',
 'uncategorized complaint': '99',
 'elevators not working': '09',
 'multiple fire code violations': '98',
 'overcrowded place of assembly': '11',
 'street numbering': '21',
 'alarm systems': '05',
 'extinguishers': '06',
 'electrical systems': '20',
 'weeds and grass': '01',
 'operating without a permit': '08',
 'hoarding': '18',
 'exit maintenance': '23',
 'leaking underground tanks': '13',
 'illegal occupancy': '25',
 'open vacant building': '07',
 'general hazardous materials': '15',
 'combustible materials': '10'}

In [7]:
# define mapping from complaint item type to category
potential_fire_cause = "potential fire cause"
potential_fire_control = "potential fire control"
fire_emergency_safety = "fire emergency safety"
multiple_violations = "multiple violations"

complaint_category_mapping = {"potential fire cause":['15', '13', '18', '10', '01', '07', '12', '20', '04'],
                              "potential fire control":['05', '19', '06'], 
                              "fire emergency safety": ['03', '24', '22', '02', '23', '21', '11']}
# reverse the mapping to get id -> category mappings
complaint_category_mapping = {d:c for c, d_list in complaint_category_mapping.items()
                                  for d in d_list}

Now, we're able to generate the output dataset.

In [28]:
from collections import defaultdict
from math import isnan

eas_to_features = defaultdict(lambda :defaultdict(float))

for d, r in complaints.iterrows():
    eas = r["EAS"]
    complaint_type = r["Complaint Item Type"]
    if not isnan(eas) and is_within_date_range(r, "2005-01-01", "2016-12-31"):
        features = eas_to_features[int(eas)]
        # increment count features for generalized complaint types
        if complaint_type in complaint_category_mapping and is_valid_complaint(r):
            feature_name = "count {}".format(complaint_category_mapping[complaint_type])
            features[feature_name] += 1
            features["count all complaints"] += 1
        
            # increment count features for generalized complaint types not corrected:
            if not is_corrected(r):
                feature_name = "count {} not corrected".format(complaint_category_mapping[complaint_type])
                features[feature_name] += 1
                features["count all complaints not corrected"] += 1
        
        # count for each complaint type, maybe remove this?
        #complaint_type_dsc = r["Complaint Item Type Description"]
        #if is_valid_complaint(r):
        #    feature_name = "count {}".format(complaint_type_dsc)
        #    features[feature_name] += 1
        #    
        #    if not is_corrected(r):
        #        feature_name = "count {} not corrected".format(complaint_type_dsc)
        #        features[feature_name] += 1

In [29]:
df = pandas.DataFrame.from_dict(eas_to_features, orient='index', dtype=float)
df.fillna(0, inplace=True)
df

Unnamed: 0,count all complaints,count potential fire control,count all complaints not corrected,count potential fire control not corrected,count fire emergency safety,count potential fire cause,count fire emergency safety not corrected,count potential fire cause not corrected
274495,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
274499,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
274504,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
274512,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
274546,9.0,8.0,2.0,1.0,1.0,0.0,1.0,0.0
274549,3.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
274552,3.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0
274555,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0
274557,5.0,2.0,0.0,0.0,3.0,0.0,0.0,0.0
274562,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
