In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, MultiPoint
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection

# Predicting building demolition risk in Philadelphia's residential neighbourhoods, 2018-2021
## Introduction and Literature Review

### Predictors of demolition risk

Proximity to the city centre or to public transport stops increases a property's risk of demolition (Weber et al., 2006). 

Both a neighbourhood's _status_ at the start of the period for analysis (in this case, 2018) and its prior _change_ are predictive of a property's demolition risk (Weber et al., 2006). 

## Presentation of Data
### Data collection
Data on demolitions, property-level, and neighbourhood-level characteristics was collected from various local and federal government sources. Given the large number of properties included in the dataset (over 500,000) and the number of features attached to each property (42), executing the code for data collection takes a considerable amount of time. This section therefore omits the full code, instead outlining the data collection process in detail. The code described in this section can be found in a separate notebook on [Github](https://github.com/caranvr/DSSS-predicting-demolition/blob/main/data-collection-code.ipynb).

#### Demolitions

The City of Philadelphia's Licenses and Inspections Department maintains a database of all demolition permits in the city issued since 2007. The addresses associated with private demolition permits from 2018-2021 were extracted, as public demolitions are more reflective of structural issues than consumer demand (Weber et al., 2006). 

#### Property characteristics

The City of Philadelphia's Office of Property Assessments (OPA) maintains a regularly updated, georeferenced database of city properties, which contains building characteristics used to assess property tax rates. Due to the large size of the dataset, this database was accessed through an API call. Based on property characteristics found by Weber et al. (2006) to predict demolition risk, the following (non-identifying) features were selected:
- **interior_condition**: a numeric code representing the quality of a property's interior. Properties are rated on a scale of 1 to 7, with 1 corresponding to new construction and 7 corresponding to "structurally compromised" (City of Philadelphia, 2021). A value of 0 indicates vacant land. 
- **exterior_condition**: a numeric code representing the property's external appearance, using the same scale as *interior_condition*. 
- **total_area**: the total area of the property.
- **year_built**: the year the property was built.

To match properties to Census tracts and find distances to relevant attractions, the latitude and longitude of each property were requested as well. In addition, **category_code_description** was selected, as Weber et al. (2006) only looked at single-family residential property demolition. Commercial, industrial, or multi-family residential properties may have different predictors of demolition risk. 

The most comprehensive property dataset is the most recent one. However, for 86.3% of properties with an associated demolition permit, the property assessment on file is from prior to demolition, or the building has not yet been demolished. The remaining properties have been re-assessed since their demolition and are now listed as vacant land. Since the model includes building characteristics, and these features are null for properties classed as 'Vacant Land', all properties in this category were dropped from the dataset. The remaining properties were matched against the demolition dataset to identify whether or not a demolition permit was attached to them after 2018. 

A separate OPA dataset was used to find each property's market value in 2018, then joined to the main properties dataset. 

#### Distance attributes

In line with Ding and Hwang (2016), City Hall was used as a proxy for the city centre. The geopandas library was used to calculate the distance from each property to City Hall. 

Finding the distance from each property to its closest public transport stop was more involved, as there is no single shapefile of all public transport stops in the city. Therefore, shapefiles of commuter rail, subway, and trolley stops were downloaded and concatenated into one dataframe. This dataframe was then converted into a binary search tree using scipy's cKDTree method, which allowed for quick spatial querying.

#### Neighbourhood attributes

Due to data availability, neighbourhoods were defined as U.S. Census tracts, which have an average population of 4,000 residents (Weber et al., 2006; Oka and Wong, 2016). The U.S. Census American Community Survey (ACS) 2013-18 5-Year Estimates were used for 2018 demographic variables, while the Longitudinal Tract Database (LTDB) — which matches 2000 Census data to post-2010 Census tract boundaries (Logan et al., 2014) — was used for the same variables in 2000. 

For comparison to the LTDB variables, ACS columns related to higher education were summed. This was done during the data collection process to reduce the size of the full dataset. Properties were then spatially matched to Census tracts using the geopandas library. 

### Data cleaning

In [2]:
#Load in processed dataset in chunks to save memory
chunks = []
chunked_df = pd.read_csv('geo_props_final.csv', chunksize=40000)

for chunk in chunked_df:
    chunks.append(chunk)

df = pd.concat(chunks)

In [3]:
df.set_index('parcel_number', inplace=True)
df.head()

Unnamed: 0_level_0,lng,lat,location,category_code_description,interior_condition,exterior_condition,total_area,year_built,demolition,market_value,...,POP00,NHWHT00,NHBLK00,HISP00,HU00,OWN00,RENT00,AG25UP00,COL00,HINC00
parcel_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11001660,-75.14854,39.931445,222 WHARTON ST,Single Family,4.0,4.0,1622.7,1960,0,212700.0,...,3715.000108,1958.888794,809.169495,331.332611,1700.323242,808.19928,635.013733,2356.681885,317.74942,36532.250861
11001670,-75.148604,39.931452,224 WHARTON ST,Single Family,4.0,4.0,1624.5,1960,0,212800.0,...,3715.000108,1958.888794,809.169495,331.332611,1700.323242,808.19928,635.013733,2356.681885,317.74942,36532.250861
11001680,-75.148668,39.931463,226 WHARTON ST,Single Family,4.0,4.0,1627.2,1960,0,212800.0,...,3715.000108,1958.888794,809.169495,331.332611,1700.323242,808.19928,635.013733,2356.681885,317.74942,36532.250861
11001690,-75.148729,39.93147,228 WHARTON ST,Single Family,4.0,4.0,1683.9,1960,0,215000.0,...,3715.000108,1958.888794,809.169495,331.332611,1700.323242,808.19928,635.013733,2356.681885,317.74942,36532.250861
11003500,-75.147067,39.930988,108 SEARS ST,Single Family,4.0,4.0,426.56,1920,0,140800.0,...,3715.000108,1958.888794,809.169495,331.332611,1700.323242,808.19928,635.013733,2356.681885,317.74942,36532.250861


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 534304 entries, 11001660 to 882150800
Data columns (total 36 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   lng                        534304 non-null  float64
 1   lat                        534304 non-null  float64
 2   location                   534304 non-null  object 
 3   category_code_description  534304 non-null  object 
 4   interior_condition         533936 non-null  float64
 5   exterior_condition         534253 non-null  float64
 6   total_area                 534304 non-null  float64
 7   year_built                 534304 non-null  int64  
 8   demolition                 534304 non-null  int64  
 9   market_value               532712 non-null  float64
 10  geometry                   534304 non-null  object 
 11  dist_city_hall             534304 non-null  float64
 12  dist_to_transport          534304 non-null  float64
 13  index_right        

Columns from the 2013-18 ACS were renamed for easier interpretation.

In [5]:
df.rename(columns={
    'B01001_001E': 'POP18',
    'B03002_003E': 'NHWHT18',
    'B03002_004E': 'NHBLK18',
    'B03001_003E': 'HISP18',
    'B25003_001E': 'HU18',
    'B25003_002E': 'OWN18',
    'B25003_003E': 'RENT18',
    'B15002_001E': 'AG25UP18',
    'B19013_001E': 'HINC18'
}, inplace=True)

Features that are on an ordinal scale were converted to category data types for later one-hot encoding. 

In [6]:
cat_cols = ['category_code_description', 'interior_condition', 'exterior_condition']

for c in cat_cols:
    if c == 'category_code_description':
        df[c] = df[c].astype('category')
    else:
        df[c] = df[c].astype(pd.UInt16Dtype()).astype('category')

Columns were then inspected for null values, which are not accepted in the classification methods used for this analysis.

In [7]:
#Check for columns where most values are null
total = df.isnull().sum().sort_values(ascending=False)
percent = ((df.isnull().sum()/df.isnull().count())*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data[:5]

Unnamed: 0,Total,Percent
market_value,1592,0.297958
interior_condition,368,0.068875
HINC18,71,0.013288
exterior_condition,51,0.009545
HINC00,0,0.0


Only four columns have missing data, and the percentage of rows in each column with missing values is small. Missing values can therefore be dropped or imputed without a significant impact on model bias.

Given that interior_condition and exterior_condition are ordinal scales, the only option for imputing missing values in these columns is mode imputation. Mode imputation for these features may be inaccurate for individual properties, even if modal values from the same Census tract rather than the whole city were used, as buildings can vary significantly even within neighbourhoods. Thus, properties with missing values for these columns were dropped.

In [8]:
missing_bldgs = df.loc[
    (df['interior_condition'].isnull()) |
    (df['exterior_condition'].isnull())
]

print(f'{(missing_bldgs.shape[0]/df.shape[0])*100:.2f}% of properties will be dropped')

0.07% of properties will be dropped


In [9]:
df.drop(index=missing_bldgs.index, inplace=True)

The 2018 market value of an individual property _is_ related to the market value of neighbouring properties. Therefore, null values could be imputed with the median value of properties in the same Census tract. However, market value is also a function of property-level characteristics, which would not be taken into consideration during imputation. Properties with missing market values were thus dropped.

In [10]:
missing_values = df.loc[
    df['market_value'].isnull()
]

print(f'{(missing_values.shape[0]/df.shape[0])*100:.2f}% of properties will be dropped')

0.29% of properties will be dropped


In [11]:
df.drop(index=missing_values.index, inplace=True)

The properties missing a Census tract median income for 2018 were all in one of two Census tracts, meaning that two Census tracts were missing 2013-18 median income data:

In [12]:
pd.DataFrame(df.loc[
    df['HINC18'].isnull()
].groupby('TRACTCE10').size().sort_values(ascending=False)).rename(columns={0: 'Missing values'})

Unnamed: 0_level_0,Missing values
TRACTCE10,Unnamed: 1_level_1
989100,39
980100,32


Given that only 71 properties were affected and in two Census tracts, missing values were imputed with the median income in Philadelphia. 

In [13]:
df['HINC18'] = df['HINC18'].fillna(df['HINC18'].median())

Lastly, the dataset was checked for invalid values in the 'year_built' column. Properties with a value of 0 in this column were dropped.

In [14]:
missing_years = df.loc[
    df['year_built'] == 0
]

print(f'{(missing_years.shape[0]/df.shape[0])*100:.2f}% of properties will be dropped')

0.33% of properties will be dropped


In [15]:
df.drop(index=missing_years.index, inplace=True)

In [16]:
df.shape

(530680, 36)

#### Feature transformation

For comparability between Census tracts, raw counts of residents or households in each demographic group were converted to percentages of a total, either of residents or housing units. (In line with U.S. Census practices, the number of residents with a Bachelor's degree or above was calculated as the proportion of residents _over 25_ with a Bachelor's degree or above.) 

In [17]:
#Convert selected features to percentages
def convert_to_pc(df, cols, denominator):
    for c in cols:
        df[c] = (df[c]/df[denominator])*100
    return df

pc_cols = {
    'POP00': ['NHWHT00', 'NHBLK00', 'HISP00'], #key is denominator, value is list of columns to divide by denominator
    'HU00': ['OWN00', 'RENT00'],
    'AG25UP00': ['COL00'],
    'POP18': ['NHWHT18', 'NHBLK18', 'HISP18'],
    'HU18': ['OWN18', 'RENT18'],
    'AG25UP18': ['COL18']
}

for k,v in pc_cols.items():
    df = convert_to_pc(df,v,k)

To measure change in neighbourhood characteristics, columns corresponding to 2000 were converted to the percentage point difference (for percentage columns) or percentage change (for total columns) in tract values from 2000 to 2018.

In [18]:
#Convert 2000 columns to percentage change from 2000-2018

#Lookup for fields encoding percentage point difference from 2000 to 2018
pc_diff = {
    'NHWHT18': 'NHWHT00',
    'NHBLK18': 'NHBLK00',
    'HISP18': 'HISP00',
    'OWN18': 'OWN00',
    'RENT18': 'RENT00',
    'COL18': 'COL00'
}

#Lookup for fields encoding percentage point change from 2000 to 2018 (counts)
pc_change = {
    'POP18': 'POP00', #total population
    'HU18': 'HU00', #total housing units
    'AG25UP18': 'AG25UP00', #25+ population
    'HINC18': 'HINC00' #median household income
}

#Replace 2000 columns with the percentage change or difference from 2000 to 2018
for k,v in pc_diff.items():
    df[v] = df[k] - df[v]

for k,v in pc_change.items():
    df[v] = ((df[k] - df[v])/df[v])*100

To enable classification, categorical features were converted into dummy variables. 

In [19]:
id_cols = ['lng', 'lat', 'geometry', 'location', 'index_right', 'TRACTCE10', 'NAME10']
X_cols = [c for c in df.columns.values if c not in id_cols and c != 'demolition']

X = df[X_cols]
Y = df[['demolition']]

In [20]:
X = pd.get_dummies(X)

In [21]:
X.shape

(530680, 46)

### Summary statistics

In [None]:
#Correlation matrix

In [None]:
#Histograms of continuous variables

## Methodology

In [None]:
#Split into training and testing set

In [None]:
#Write function to tune hyperparameters for each classifier with GridSearchCV

### Random Forest classifier

In [None]:
#Tune hyperparameters using GridSearchCV

In [None]:
#Fit classifier

### k-Nearest Neighbours classifier

In [None]:
#Tune hyperparameters using GridSearchCV

In [None]:
#Fit classifier

## Results and Discussion

In [None]:
#Metrics for each classifier (table)

In [None]:
#Confusion matrix for each classifier (figure)

In [None]:
#Map of false positives for each classifier

In [None]:
#Map of false negatives for each classifier

## Conclusion

## References