In [2]:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, MultiPoint
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection

# Predicting building demolition risk in Philadelphia's residential neighbourhoods, 2018-2021

## Introduction and Literature Review

### Predictors of demolition risk

Proximity to the city centre or to public transport stops increases a property's risk of demolition (Weber et al., 2006). 

Both a neighbourhood's _status_ at the start of the period for analysis (in this case, 2018) and its prior _change_ are predictive of a property's demolition risk (Weber et al., 2006). 


## Presentation of Data
### Data collection
Data on demolitions, property-level, and neighbourhood-level characteristics was collected from various local and federal government sources. Given the large number of properties included in the dataset (over 500,000) and the number of features attached to each property (42), executing the code for data collection takes a considerable amount of time. This section therefore omits the full code, instead outlining the data collection process in detail. The code described in this section can be found in a separate notebook on [Github](https://github.com/caranvr/DSSS-predicting-demolition/blob/main/data-collection-code.ipynb).

#### Demolitions

The City of Philadelphia's Licenses and Inspections Department maintains a database of all demolition permits in the city issued since 2007. The addresses associated with private demolition permits from 2018-2021 were extracted, as public demolitions are more reflective of structural issues than consumer demand (Weber et al., 2006). 

#### Property characteristics

The City of Philadelphia's Office of Property Assessments (OPA) maintains a regularly updated, georeferenced database of city properties, which contains building characteristics used to assess property tax rates. Due to the large size of the dataset, this database was accessed through an API call. Based on property characteristics found by Weber et al. (2006) to predict demolition risk, the following (non-identifying) features were selected:
- **interior_condition**: a numeric code representing the quality of a property's interior. Properties are rated on a scale of 2 to 7, with 2 corresponding to "noticeably new construction" and 7 corresponding to "structurally compromised" (City of Philadelphia, 2021). A value of 0 indicates vacant land. 
- **exterior_condition**: a numeric code representing the property's external appearance, using the same scale as *interior_condition*. 
- **total_area**: the total area of the property.
- **year_built**: the year the property was built.

To match properties to Census tracts and find distances to relevant attractions, the latitude and longitude of each property were requested as well. In addition, **category_code_description** was selected, as Weber et al. (2006) only looked at single-family residential property demolition. Commercial, industrial, or multi-family residential properties may have different predictors of demolition risk. 

The most comprehensive property dataset is the most recent one. However, for 86.3% of properties with an associated demolition permit, the property assessment on file is from prior to demolition, or the building has not yet been demolished. The remaining properties have been re-assessed since their demolition and are now listed as vacant land. Since the model includes building characteristics, and these features are null for properties classed as 'Vacant Land', all properties in this category were dropped from the dataset. The remaining properties were matched against the demolition dataset to identify whether or not a demolition permit was attached to them after 2018. 

A separate OPA dataset was used to find each property's market value in 2018, then joined to the main properties dataset. 

#### Distance attributes

In line with Ding and Hwang (2016), City Hall was used as a proxy for the city centre. The geopandas library was used to calculate the distance from each property to City Hall. 

Finding the distance from each property to its closest public transport stop was more involved, as there is no single shapefile of all public transport stops in the city. Therefore, shapefiles of commuter rail, subway, and trolley stops were downloaded and concatenated into one dataframe. This dataframe was then converted into a binary search tree using scipy's cKDTree method, which allowed for quick spatial querying.

#### Neighbourhood attributes

For the purposes of this analysis, neighbourhoods were defined as U.S. Census tracts, which have an average population of 4,000 residents (Oka and Wong, 2016). For status variables, 


#### Neighbourhood attributes

### Data cleaning

In [3]:
#Load in processed dataset
chunks = []
chunked_df = pd.read_csv('geo_properties_full.csv', chunksize=40000)

for chunk in chunked_df:
    chunks.append(chunk)

df = pd.concat(chunks)

In [6]:
df.head()

Unnamed: 0,parcel_number,lng,lat,location,category_code_description,interior_condition,exterior_condition,total_area,year_built,demolition,...,POP00,NHWHT00,NHBLK00,HISP00,HU00,OWN00,RENT00,AG25UP00,COL00,HINC00
0,11000600,-75.146866,39.931278,108 WHARTON ST,Single Family,5.0,5.0,779.0,1920,0,...,23.553159,31.344902,-19.275697,-0.944922,11.214148,23.594292,-8.473036,57.000401,38.637068,80.777254
1,11000700,-75.146921,39.931286,110 WHARTON ST,Single Family,2.0,2.0,779.1,1920,0,...,23.553159,31.344902,-19.275697,-0.944922,11.214148,23.594292,-8.473036,57.000401,38.637068,80.777254
2,11000800,-75.146971,39.931292,112 WHARTON ST,Single Family,4.0,4.0,725.2,1920,0,...,23.553159,31.344902,-19.275697,-0.944922,11.214148,23.594292,-8.473036,57.000401,38.637068,80.777254
3,11000900,-75.147034,39.93123,114 WHARTON ST,Single Family,4.0,4.0,1433.0,1920,0,...,23.553159,31.344902,-19.275697,-0.944922,11.214148,23.594292,-8.473036,57.000401,38.637068,80.777254
4,11001000,-75.147087,39.931236,116 WHARTON ST,Single Family,4.0,4.0,1500.0,1920,0,...,23.553159,31.344902,-19.275697,-0.944922,11.214148,23.594292,-8.473036,57.000401,38.637068,80.777254


In [8]:
#

77085    91.304687
77396    91.304687
77398    91.304687
77399    91.304687
77400    91.304687
           ...    
69754     0.383896
69753     0.383896
69752     0.383896
69751     0.383896
69481     0.383896
Name: COL00, Length: 535734, dtype: float64

In [13]:
df.loc[[77085, 77396, 77398, 77399, 77400],['COL18', 'COL00', 'location']]

Unnamed: 0,COL18,COL00,location
77085,93.481595,91.304687,2018-32 WALNUT ST
77396,93.481595,91.304687,1806-18 RITTENHOUSE SQ
77398,93.481595,91.304687,1806-18 RITTENHOUSE SQ
77399,93.481595,91.304687,1806-18 RITTENHOUSE SQ
77400,93.481595,91.304687,1806-18 RITTENHOUSE SQ


In [None]:
#Check for columns where most values are null, then drop

In [None]:
#Inspect columns for invalid values

In [None]:
#Convert 2000 columns to percentage change from 2000-2018

#Lookup for fields encoding percentage point difference from 2000 to 2018
pc_diff = {
    'B03002_003E': 'NHWHT00',
    'B03002_004E': 'NHBLK00',
    'B03001_003E': 'HISP00',
    'B25003_002E': 'OWN00',
    'B25003_003E': 'RENT00',
    'COL18': 'COL00'
}

#Lookup for fields encoding percentage point change from 2000 to 2018 (counts)
pc_change = {
    'B01001_001E': 'POP00', #total population
    'B25003_001E': 'HU00', #total housing units
    'B15002_001E': 'AG25UP00', #25+ population
    'B19013_001E': 'HINC00' #median household income
}

#Replace 2000 columns with the percentage change or difference from 2000 to 2018
for k,v in pc_diff.items():
    df[v] = df[k] - df[v]

for k,v in pc_change.items():
    df[v] = ((df[k] - df[v])/df[v])*100

In [None]:
#Check rows where most values are null, then drop

In [None]:
#Convert categorical features into dummies using sklearn DictVectorizer

### Summary statistics

In [None]:
#Correlation matrix

In [None]:
#Histograms of continuous variables

## Methodology

In [None]:
#Write function to tune hyperparameters for each classifier with GridSearchCV

### Random Forest classifier

In [None]:
#Tune hyperparameters using GridSearchCV

### k-Nearest Neighbours classifier

In [None]:
#Tune hyperparameters using GridSearchCV

## Results and Discussion

In [None]:
#Metrics for each classifier (table)

In [None]:
#Confusion matrix for each classifier (figure)

In [None]:
#Map of false positives for each classifier

In [None]:
#Map of false negatives for each classifier

## Conclusion

## References