# Preliminary fire incident data set builder (v.1.0)
### August 23, 2017

## Summary

This noebook includes code to build "fire_incident_master_20170823.csv", v.1.0 of the master fire incident data set.  Below is a dictionary of variable names and definitions.

The aim of v.1.0 is to have a preliminary cleaned dataset from which team members can begin to perform data analysis and model testing.  v.1.0 currently includes only a limited number of test variables from the tax data set.  Going forward, the goal will be to continue to add control variables as they become available.

Please note that, while some outlier observations have been removed, this data set could be cleaned further.  For example, there are some observations where num_bathrooms/bedrooms are very large and likely are input errors.

You can find "fire_incident_master_20170823.csv" at: 


## Variable definitions

**EAS:** Building location code. The dataset has been collapsed on EAS (there are cases where multiple fire incidents occur at a given EAS b/w 2003). Each observation in data is a unique location code, and there are no duplicates.  Float variable.

**Latest_Incident_Date:**  The date of the latest fire-related incident at an EAS between 2003 - 2017.  If there has been no fire-related incident at an EAS between 2003-2017, observation is missing.  Pandas date-time variable.

**Latest_Incident_Year:**  The year of the latest fire-related incident at an EAS between 2003 - 2017.  If there has been no fire-related incident at an EAS between 2003-2017, observation is missing.  Float variable.

**Fire_Incident_Type:** Type of fire incident that occured during the most recent fire incident at a given EAS.  If there has been no fire-related incident at an EAS between 2003-2017, observation is missing.  String variable.

**Fire_Incident_Count:** Total number of fire incidents recorded for a given EAS between 2003-2017.  If no fire incidents occurred, observation is 0.  Float variable.

**Year_Property_Built:**  Year property was built.  Observations not between 1875 and 2017 have been removed.  Float variable.

**Num_Bathrooms:**  Number of bathrooms at property.  Float variable.

**Num_Bedrooms:**  Number of bedrooms at property.  Float variable.

**Num_Rooms:**  Number of rooms at property.  Float variable.

**Num_Stories:**  Number of building stories.  Observations > 100 have been removed.  Float variable.

**Num_Units:**  Number of units at property.  Float variable.

**Perc_Ownership:**  Percent ownership by single owner (?).  Float variable between 0 and 1.  Observations >1 removed.

**Land_Value:**  Estimated value of land.  Float variable.

**Property_Area:**  Area (in sq. ft.) of property.  Float variable.  Observations < 10 removed.

**Neighborhood:**  San Francisco neighborhood where property is located.  String variable.

**Location_y:**  Location coordinates of property.  String variable.

**Address:**  Address of property.  String variable.

**Property_Code_Desc:**  Type of building.  Each observation contains a property type code, which is a one or two-digit alphanumeric, in addition to the description of building type.  String variable.


## Code

In [None]:
#Load packages
import numpy as np
import pandas as pd
from datetime import datetime

path = 'C:\\Users\\Kevin\\Desktop\\Fire Risk\\Model_matched_to_EAS'

#Load EAS-matched fire incident data
incident_df = pd.read_csv(path + '\\' + 'matched_Fire_Incidents.csv', 
              low_memory=False)

#Load EAS-matched fire safety complaint data
complaint_df = pd.read_csv(path + '\\' + 'matched_Fire_Safety_Complaints.csv', 
              low_memory=False)

#load EAS-matched tax data
tax_df = pd.read_csv(path + '\\' + 'matched_EAS_Tax_Data.csv', 
              low_memory=False)

#Functions
def standardize(data, col, drop, newname):
    t = data[col].value_counts().reset_index()
    t['code'] = t['index'].apply(lambda s: s[0:3])
    droppers = drop
    t = t[~t['index'].isin(droppers)]
    t.drop_duplicates(subset='code', keep='first', inplace=True)
    t['index'] = t['index'].apply(lambda x: x.upper())
    t['index'] = t['index'].str.replace(' - ', ' ' )
    t.rename(columns = {'index': newname}, inplace=True)
    t.drop(col, axis=1, inplace=True)
    return t

#Prepare fire incident data
incident_df = incident_df[['Incident Date',
                           'Primary Situation',
                           'EAS']].dropna()  #Drop where any variable NAN
incident_df['Incident Date'] = pd.to_datetime(incident_df['Incident Date'])  #To pd datetime
incident_df['Incident_Year'] = incident_df['Incident Date'].dt.year

#Standardize Primary Situation column
temp = standardize(incident_df, 'Primary Situation', 
                   ['10 -', '1 -', '11 -'], 
                   'Fire_Incident_Type')
incident_df['code'] = incident_df['Primary Situation'].apply(lambda s: s[0:3])
incident_df = pd.merge(incident_df, temp, on='code').dropna()
incident_df.drop(['Primary Situation', 'code'], axis=1, inplace=True)

#Collapse data on EAS
incident_collapse = incident_df.groupby('EAS').max().reset_index()  #ARE WE OK HERE????????
temp = incident_df['EAS'].value_counts().reset_index()
temp.rename(columns = {'EAS': 'Fire_Incident_Count'}, inplace=True)
temp.rename(columns = {'index': 'EAS'}, inplace=True)
incident_collapse = pd.merge(incident_collapse, temp, on='EAS')
incident_collapse.rename(columns = {'Incident Date': 'Latest_Incident_Date'}, inplace=True)
incident_collapse.rename(columns = {'Incident Year': 'Latest_Incident_Year'}, inplace=True)

#Prepare tax_df data
tax_df = tax_df[['Year Property Built',
                 'Number of Bathrooms',
                 'Number of Bedrooms',
                 'Number of Rooms',
                 'Number of Stories',
                 'Number of Units',
                 'Percent of Ownership',
                 'Closed Roll Assessed Land Value',
                 'Property Area in Square Feet',
                 'EAS BaseID',
                 'Neighborhoods - Analysis Boundaries',
                 'Property Class Code',
                 'Property_Class_Code_Desc',
                 'Location_y',
                 'Address']].dropna()

tax_df.rename(columns = {'EAS BaseID': 'EAS'}, inplace=True)
tax_df.rename(columns = {'Neighborhoods - Analysis Boundaries': 'Neighborhood'}, inplace=True)

tax_collapse = tax_df.groupby('EAS').max().reset_index()
tax_collapse['Property_Class_Code_Desc'] = tax_collapse['Property_Class_Code_Desc'].apply(lambda x: x.upper())
tax_collapse['Neighborhood'] = tax_collapse['Neighborhood'].apply(lambda x: x.upper())
tax_collapse['Property_Code_Des'] = tax_collapse['Property Class Code'] + ' - ' + tax_collapse['Property_Class_Code_Desc']
tax_collapse.drop(['Property_Class_Code_Desc', 'Property Class Code'], axis=1, inplace=True)

#Merge to master dataset
master_df = pd.merge(incident_collapse, tax_collapse, how='outer', on='EAS')
master_df['Fire_Incident_Count'].fillna(0, inplace=True)
master_df = master_df.dropna(subset=['Address'])

#Standardize var names
master_df.rename(columns = {'Year Property Built': 'Yr_Property_Built'}, inplace=True)
master_df.rename(columns = {'Number of Bathrooms': 'Num_Bathrooms'}, inplace=True)
master_df.rename(columns = {'Number of Bedrooms': 'Num_Bedrooms'}, inplace=True)
master_df.rename(columns = {'Number of Rooms': 'Num_Rooms'}, inplace=True)
master_df.rename(columns = {'Number of Stories': 'Num_Stories'}, inplace=True)
master_df.rename(columns = {'Number of Units': 'Num_Units'}, inplace=True)
master_df.rename(columns = {'Percent of Ownership': 'Perc_Ownership'}, inplace=True)
master_df.rename(columns = {'Closed Roll Assessed Land Value': 'Land_Value'}, inplace=True)
master_df.rename(columns = {'Property Area in Square Feet': 'Property_Area'}, inplace=True)

#Removing some obvious outliers/data errors
#Removing obs where building age is <1875 or >2017
master_df = master_df[master_df['Yr_Property_Built'] > 1875]
master_df = master_df[master_df['Yr_Property_Built'] <= 2017 ]
#Remove where percent ownership > 1
master_df = master_df[master_df['Perc_Ownership'] <= 1]
#Remove where property area < 10
master_df = master_df[master_df['Property_Area'] >= 10]
#Remove where num_stories > 100
master_df = master_df[master_df['Num_Stories'] < 100]

#Export data
master_df.to_csv(path_or_buf= path + '\\' + 'fire_incident_master_20170823.csv', index=False)
