# **Predicting Building Permit Issuance Times for San Francisco?**
##                                                                                             ...and answering many questions!


***A Data Science Project by Aparna Shastry***

## Content
+ Introduction
+ Data
+ What Are We Trying to Find Out?
+ Cleaning the Data
+ Exploratory Data Analysis
+ Drawing Inferences
+ Identification of features
+ Modeling
+ Predicting
+ Conclusions

## Introduction
A building permit is an official approval document issued by a governmental agency that allows you or your contractor to proceed with a construction or remodeling project on one's property. For more details click [here](https://www.thespruce.com/what-is-a-building-permit-1398344). Each city or county has its own office related to buildings, that can do multiple functions like issuing permits, inspecting buildings to enforce safety measures, modifying rules to accommodate needs of the growing population etc. For the city of San Francisco, permit issueing is taken care by [Permit Services wing of Department of Building Inspection](http://sfdbi.org/permit-services) (henceforth called DBI).
The delays in permit issuance pose serious problems to construction industries and later on real estate agencies.Read this [Trulia study](https://www.trulia.com/blog/trends/elasticity-2016/) and [Vancouver city article](https://biv.com/article/2014/11/city-building-permit-delays-costing-developers-tim).

## Data
Data for this project is available in San Francisco city open data portal. It is updated every Saturday.
1. Go to the link: [SF](https://data.sfgov.org/Housing-and-Buildings/Building-Permits/i98e-djp9/data) open portal. 
2. Click on Filter and "Add a Filter Condition".
3. A drop down menu appears.
4. Select, "Filed Date" and "is after".
5. I entered date as 12/31/2012, because I wanted to do analysis of last 4-5 years. I think most recent data is important in matters such as this, the city council policies could change, there might be new rules, new employers to expedite process etc. Old data may not be too useful in modeling.
6. I chose to download in CSV format because it is of the less than 100MB size and easy to load into notebook. If it gave issues, I might have chosen a different format.

The file as of Feb 25, 2018 (Sunday) has been downloaded and kept for easy access. Size is about 75MB

### What Are We Trying to Find Out?

Primary Objective of this project is to model the building permit issuance times using at least last 3 years of permit filing/issuing data, so that this model can be used to predict the permit issuance times for the applications filed in future. Apart from that, this work will address a few concerns/answers a few questions which are of help in practical life in construction industry. We will get to the specifics once we get initial glimpse of columns in the data.

In [1]:
import os
import pandas as pd
import numpy as np
import time
import datetime

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style(style='white')

from scipy.stats import pearsonr
import statsmodels.api as sm
from statsmodels.formula.api import ols

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

  from pandas.core import datetools


In [2]:
%%time
# Read and make a copy for speed
sf = pd.read_csv('../data/Building_Permits.csv',low_memory=False)
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198900 entries, 0 to 198899
Data columns (total 43 columns):
Permit Number                             198900 non-null object
Permit Type                               198900 non-null int64
Permit Type Definition                    198900 non-null object
Permit Creation Date                      198900 non-null object
Block                                     198900 non-null object
Lot                                       198900 non-null object
Street Number                             198900 non-null int64
Street Number Suffix                      2216 non-null object
Street Name                               198900 non-null object
Street Suffix                             196132 non-null object
Unit                                      29479 non-null float64
Unit Suffix                               1961 non-null object
Description                               198610 non-null object
Current Status                            198900 n

### What are We Trying to Find Out (Continued):

Specifically, we will be trying to answer the following in the next parts:

+ **Data Cleaning:**      
1) Which of these columns we should retain for further analysis? This is the first question to answer because eliminating obviously non-useful columns would save lot of time.        
2) How to interpret the records with zero or very small values for cost variables related to the construction?                
3) What to do we do if dates map to Saturday or Sunday? A date having invalid numbers for month and day of the month?      
4) Should we replace blanks in some columns where the valid value is 'Y'?       
         
+ **Exploratory Data Analysis:**   
1) What is the best day of the week to visit DBI, to file an application form? Is the popular belief “mid-week (Wednesday) is the least crowded and hence best to visit government or city agencies” true in this case?    
2) What type of permits are mostly issued on the same day of filing?     
3) Which types take least average time issue if not issued on the same day?    
4) Is there any particular quarter of each year which has higher application counts or average wait times? Can it be justified from the business knowledge?    
5) Is the “Revised Cost” of a construction always more than “Estimated Cost”?     
6) Is there a correlation between the location and wait times?     

+ **Drawing Inferences:**         
1) Is there a statistically significant difference between issuance times for Residential Vs Commercial buildings?             
2) Is there a statistically significant difference between wait times of fire only permits and not fire only permits? Similarly with site permit. What is the interpretation?     
3) Is there a statistically significant difference in wait time for high cost and low cost projects
4) Can we do Anova test to verify variability across years in wait times, or variability in issue dates across weekdays        

+ **Modeling:**     
1) What are the main factors influencing the building permit issuance times?         
2) How does the building estimated cost relate to wait times?         

+ **Predicting:**      
1) Predict the time to issue from  the date of filing, based on the model chosen in modeling stage

In [3]:
# Conversion to datetime
import traceback
try :
    sf['Filed Date'] = pd.to_datetime(sf['Filed Date'],errors='coerce')
    sf['Issued Date'] = pd.to_datetime(sf['Issued Date'],errors='coerce')
    sf['Permit Expiration Date'] = pd.to_datetime(sf['Permit Expiration Date'],errors='coerce')
except :    
    traceback.print_exc()

# Keep a copy to reload
sfcpy = sf.copy()

In [4]:
# Sometimes when re-run is required, one can start from just here, to save time
sf = sfcpy.copy()

In [5]:
# Rename for brevity/readability
sf = sf.rename(columns =   {'Neighborhoods - Analysis Boundaries':'neighborhoods',
                            'Permit Type' : 'perm_typ',
                            'Permit Type Definition': 'perm_typ_def',
                            'Filed Date':'file_dt',
                            'Issued Date':'issue_dt',
                            'Permit Expiration Date' : 'perm_exp_dt',
                            'Structural Notification':'strct_notif',
                            'Number of Existing Stories':'no_exist_stry',
                            'Number of Proposed Stories':'no_prop_stry',
                            'Fire Only Permit':'fire_only_permit',
                            'Estimated Cost':'est_cost',
                            'Revised Cost':'rev_cost',
                            'Existing Use':'exist_use',
                            'Proposed Use': 'prop_use',
                            'Plansets':'plansets',
                            'Existing Construction Type': 'exist_const_type',
                            'Existing Construction Type Description': 'exist_const_type_descr',
                            'Proposed Construction Type': 'prop_const_type',
                            'Proposed Construction Type Description': 'prop_const_type_descr',
                            'Site Permit':'site_permit',
                            'Supervisor District':'sup_dist',
                            'Location':'location'
                            })

In [6]:
sf.columns

Index(['Permit Number', 'perm_typ', 'perm_typ_def', 'Permit Creation Date',
       'Block', 'Lot', 'Street Number', 'Street Number Suffix', 'Street Name',
       'Street Suffix', 'Unit', 'Unit Suffix', 'Description', 'Current Status',
       'Current Status Date', 'file_dt', 'issue_dt', 'Completed Date',
       'First Construction Document Date', 'strct_notif', 'no_exist_stry',
       'no_prop_stry', 'Voluntary Soft-Story Retrofit', 'fire_only_permit',
       'perm_exp_dt', 'est_cost', 'rev_cost', 'exist_use', 'Existing Units',
       'prop_use', 'Proposed Units', 'plansets', 'TIDF Compliance',
       'exist_const_type', 'exist_const_type_descr', 'prop_const_type',
       'prop_const_type_descr', 'site_permit', 'sup_dist', 'neighborhoods',
       'Zipcode', 'location', 'Record ID'],
      dtype='object')

## Cleaning the Data

The Tricky part of Data Wrangling in my knowledge so far,   

a) Knowing what is present in the 'null' cells, is it NaN or simply ' '     
b) In the non-null cells, if the all values are meaningful       
c) Recognizing that even if a column has all non-null and meaningful values, the future updates to the column may have problems. Hence need to expect it and handle it      
d) See the data and think if certain value make sense for the business and decide to drop those which are not relevant

##### Answering the questions:
1) The following columns are retained for further analysis. More could be added after discussion

In [7]:
sfr = sf[['perm_typ','file_dt','issue_dt','perm_exp_dt','strct_notif','no_exist_stry','no_prop_stry','fire_only_permit','est_cost','rev_cost',
          'exist_use','prop_use','plansets','exist_const_type','prop_const_type','site_permit','neighborhoods',
          'sup_dist','location']]

2) Cost of the project is an essential part of the application according to [this post](http://www.herald-journal.com/housing/pages/older/permit.html). Hence if any of the 2 cost related columns have 0 or unusually small numbers, the best option is to drop those rows. Those are outliers

In [8]:
# Drop too small values, we choose threshold 10. This is subjective. Could drop more in EDA.
sfr = sfr.loc[sfr['est_cost'] > 10,:]
sfr = sfr.loc[sfr['rev_cost'] > 10,:]

3) I would attribute it to typing mistake and make it previous or next day respectively. This may not be accurate, however it will not show a weekend in the EDA part.   

In [9]:
sfr['file_day'] = sfr['file_dt'].dt.weekday_name
sfr['issue_day'] = sfr['issue_dt'].dt.weekday_name
sfr.loc[sfr['file_day']=='Saturday','file_dt']  = sfr.loc[sfr['file_day']=='Saturday','file_dt'] - datetime.timedelta(1)
sfr.loc[sfr['file_day']=='Saturday','file_day'] = 'Friday'
sfr.loc[sfr['file_day']=='Sunday','file_dt'] =  sfr.loc[sfr['file_day']=='Sunday','file_dt'] + datetime.timedelta(1)
sfr.loc[sfr['file_day']=='Sunday','file_day'] = 'Monday'
sfr.loc[sfr['issue_day']=='Saturday','issue_dt'] = sfr.loc[sfr['issue_day']=='Saturday','issue_dt'] - datetime.timedelta(1)
sfr.loc[sfr['issue_day']=='Saturday','issue_day'] = 'Friday'
sfr.loc[sfr['issue_day']=='Sunday','issue_dt'] = sfr.loc[sfr['issue_day']=='Sunday','issue_dt'] + datetime.timedelta(1)
sfr.loc[sfr['issue_day']=='Sunday','issue_day'] = 'Monday'

4) In the application forms (both physical or online), normally the applicant is supposed to tick the option if applicable. Otherwise nothing needs to be done. Hence it is understandable that blanks mean not applicable, a "No".

In [10]:
# Fill na with N. because in building permit databases, this is ticked if yes and left blank if it is not applicable
sfr['fire_only_permit'].fillna('N',inplace=True)
sfr['site_permit'].fillna('N',inplace=True)
sfr['strct_notif'].fillna('N',inplace=True)

In [11]:
sfr['time_taken'] = (sfr['issue_dt'] - sfr['file_dt']).dt.days
sfr['valid_period'] = (sfr['perm_exp_dt'] - sfr['issue_dt']).dt.days

# We are interested in only finished ones for prediction of time
sfr = sfr[sfr['time_taken'].notna()]

In [12]:
sfr['month'] = sfr['file_dt'].dt.month
sfr['year'] = sfr['file_dt'].dt.year

There is no need to check for validity in month or days because otherwise datatime conversion would have failed.

In [13]:
# Fill nans in construction types with categorical numbers
sfr.exist_const_type.fillna(9,inplace=True)
sfr.prop_const_type.fillna(9,inplace=True)

# Number of story: nans into a categorical representative number
sfr.no_exist_stry.fillna(200,inplace=True)
sfr.no_prop_stry.fillna(200,inplace=True)

# Make Nan's into category number for plansets
sfr.plansets = sfr.plansets.fillna(10).astype(int)

In [14]:
# Fill location unknowns with 0's to indicate, it is unknown
sfr.location.fillna('(0,0)',inplace=True)

#Convert to float
sfr.location = sfr.location.apply(lambda x: np.array([float((str(x).split('(')[1]).split(',')[0]),float((str(x).split('(')[1]).split(',')[1].split(')')[0])]))

In [15]:
# Replace nans with strings "Unknown"
sfr.exist_use.fillna('Unknown',inplace=True)
sfr.prop_use.fillna('Unknown',inplace=True)

# Some more variables
sfr.neighborhoods.fillna('Unknown',inplace=True)
sfr.sup_dist.fillna('Unknown',inplace=True)

# Taking care to ensure that future data will not have Nans
# Currently none

In [16]:
sfr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 130636 entries, 0 to 198896
Data columns (total 25 columns):
perm_typ            130636 non-null int64
file_dt             130636 non-null datetime64[ns]
issue_dt            130636 non-null datetime64[ns]
perm_exp_dt         130635 non-null datetime64[ns]
strct_notif         130636 non-null object
no_exist_stry       130636 non-null float64
no_prop_stry        130636 non-null float64
fire_only_permit    130636 non-null object
est_cost            130636 non-null float64
rev_cost            130636 non-null float64
exist_use           130636 non-null object
prop_use            130636 non-null object
plansets            130636 non-null int32
exist_const_type    130636 non-null float64
prop_const_type     130636 non-null float64
site_permit         130636 non-null object
neighborhoods       130636 non-null object
sup_dist            130636 non-null object
location            130636 non-null object
file_day            130636 non-null object
i

In [17]:
# Save the clean file
sfr.to_csv('../data/building_permit_clean.csv',index=False)

Comments on the Data Wrangling: All essential variables for predicting wait times are cleaned up