# Detroit Blight

# Introduction

## Defining the Problem

Urban blight is one of those nebulous concepts where everyone can picture it, but no one can define it exactly.  [One paper (Morckel, 2014)](http://www.thecyberhood.net/documents/papers/cd2014.pdf) found over 20 different definitions of the phenomenon.  The Department of Housing and Urban Development defines a structure as blighted when  ["...it exhibits objectively determinable signs of deterioration sufficient to constitute a threat to human health, safety, and public welfare."](https://www.huduser.gov/portal/glossary/glossary_all.html#b)  Most of the definitions agree a property can be considered blighted if:
- The property is owned by the local government, typically through a lien after tax foreclosure.
- The property in disrepair, potentially to a dangerous degree.
- The property is or may be demolished.

We will use these three ideas as our working description of blight.  Here is an example of one such blighted building in Detroit:

<img src="files/figs/blighted_building.png" alt="blighted building" style="width: 300px;" />
<small><i><a href="https://www.google.com/maps/@42.3359412,-83.0484953,3a,84.5y,212.04h,96.33t/data=!3m6!1e1!3m4!1sBpVLPQ_vUywLusKvOr6UWw!2e0!7i13312!8i6656!6m1!1e1">figure source</a></i></small>

## Blight in Detroit

This paper will focus on one of the cities most affected by urban blight, Detroit, MI.  Detroit has an ongoing [open data initiative](https://data.detroitmi.gov/) to help the city make policy decisions.  My goal is to create a model that will predict whether a building is blighted or not based on public data available from the city.  Hopefully, this model will help the city to understand and identify the causes of blight.

## Gathering Data
Now, we need to get data that will help us to identify whether a building is blighted or not.  First, it is helpful to identify what constitutes a "building".  For our purposes, a building will be defined as being represented by an address, a property identification (parcel) number, and a set of coordinate boundaries delineating the property.  This information can be found in the [parcel map shapefile](https://data.detroitmi.gov/Property-Parcels/Parcel-Map/fxkw-udwf) and the [parcel points ownership](https://data.detroitmi.gov/Property-Parcels/Parcel-Points-Ownership/eijm-6nr4) file.

Other data collected includes:
- [detroit-311.csv](https://data.detroitmi.gov/Government/Improve-Detroit-Submitted-Issues/fwz3-w3yn):  Improve Detroit issues ([311 requests](https://en.wikipedia.org/wiki/3-1-1))
- [detroit-blight-violations.csv](https://data.detroitmi.gov/Property-Parcels/Blight-Violations/teu6-anhh):  violations of [Detroit's blight code](https://www.municode.com/library/mi/detroit/codes/code_of_ordinances?nodeId=PTIIICICO_CH9BUBURE)
- [detroit-demolition-permits.csv](https://data.detroitmi.gov/Government/Detroit-Demolitions/rv44-e9di):  building demolitions performed by the Detroit government
- [detroit-fire.csv](https://data.detroitmi.gov/Public-Safety/2015-Fire-Data/g7tj-vvtd):  fire department calls

All data was collected for the year 2015, except for the demolition permits which were collected 2015-present.  The rationale for this is that blighted buildings may wait a while to be demolished due to budget concerns, etc.  Therefore, buildings demolished in 2016 likely were blighted in 2015.  Also, as we will see, the dataset is imbalanced with many more non-blighted than blighted buildings.  Collecting through present gives us more positive examples to work with.



# Data Wrangling and Feature Engineering
Recall our working description of blight:

- The property is owned by the city, typically through a lien after tax foreclosure.
- The property in disrepair, potentially to a dangerous degree.
- The property is or may be demolished.

This section will focus on generating features that can describe these criteria.

## City Ownership
The parcel ownership shapefile contains a column "tax_status" that lists a property's tax status e.g. taxable (private ownership), city owned, or county owned.  These tax statuses were cleaned and one-hot encoded.

## Disrepair
### Property value
The value of a property is a good proxy for its condition.  In the parcel ownership file, the column "SEV" or sale estimated value was used to represent property value.  

### Location
Blight tends to vary geospatially neighborhood-to-neighborhood, so the latitude and longitude of the building were used as features as well.  The map below shows a hexplot of blighted buildings by location with arterial roads denoted by interior lines. As we will see, blight is far from evenly distributed:

<img src="files/figs/blighted_buildings_hex.png" alt="blighted building"  style="height: 600px;"/>

### Blight Violations
Blight violations represent violations of Detroit's municipal code.  Some examples are:
- Sec. 9-1-111. Graffiti and defacement; duty to remove.
- Sec. 9-1-113. Minimum requirements for vacant buildings and structures

As the above might suggest, blight violations are highly informative as to a property's physical status.  These data were cleaned and one-hot encoded on violation type.  The violations were then summed over location.

### 311 Requests
311 requests are calls for municipal service for issues like running water in a building, illegal dumping of waste, traffic signal issues, and clogged street drains.  Some of these features are clearly more informative than others, but since the machine learning model we will use, gradient boosting, has feature selection baked-in and is quite robust to overfitting, all of the 311 request types were considered.  As before, data was cleaned and one-hot encoded on issue type and summed by number of occurences.

### Fire Department Calls
As one might expect, fires are one of the leading causes of buildings becoming irreparably damaged.  If the owner cannot afford to replace the house, then these fire-damaged buildings are at a high risk of becoming blighted.  Also included in this file are other call types, like rubbish fires, that can be indicative of the property's condition.

## Demolitions:  Defining Positive Examples
The demolition file represents demolitions done by the local government of Detroit.  Since blight far outnumbers all other reasons for demolition in Detroit, the demolished buildings were assumed to be blighted.  Data was cleaned to remove incorrect parcel numbers.


# Training and Test Sets:  Putting It All Together

## Joins

After cleaning and processing, it was necessary to somehow aggregate these data with the filtered parcel ownership list.  The blight violations and demolitions were joined to the parcel file on parcel number.  The 311 issues and fire department calls required a bit more processing first.

311 requests are not necessarily specific to a building, so these data only contain latitude and longitude coordinates as a location identifier.  As such, the data was spatially joined using point-in-polygon tests on the coordinate data and building parcel shapefiles.  Computation was sped up using an algorithm called rtree:

### Spatial Joins with R-trees

[*R-trees*](https://en.wikipedia.org/wiki/R-tree) are a way to perform containment tests on spatial objects in *O*(log *n*) time, not including construction of the tree.  They consist of [minimum spanning rectangles](https://en.wikipedia.org/wiki/Minimum_bounding_rectangle) placed around the more complicated polygonal parcel  shapes.  From these rectangles, the algorithm then recursively draws larger and larger minimum spanning rectangles bounding larger and larger groupings of objects.  These rectangles are axis-aligned with the coordinate system being used.  In our case, the x-dimension represents projected longitude and the y-dimension represents projected latitude.

Rectangles are used because of their computationally efficient point-in-polygon tests, but this comes with one caveat:  for precise point-in polygon tests, the original polygon still must be tested because the minimum spanning rectangle might cover area not covered by the polygon itself.  In this case, a point can be inside the rectangle, but not in the polygon as shown below:

![Alt text](files/figs/bounding_rectangle.png)

Or the point could be located within multiple spanning rectangles, in which case all of the intersecting rectangles' polygons must be iterated over, as shown below:

![Alt text](files/figs/two_rectangles.png)

Unlike the 311 issues, the fire data did not come with nice coordinate data.  In fact, I tried all of the common projections for Michigan (StatePlane, GeoRef, etc.) and couldn't find one that played nice with the fire coordinates. As such, joins on the one-hot encoded incident type and the building list were done using the address.  *A lot* of cleaning and standardization had to be done on the addresses before they were usable, but they eventually got there. See the data processing file for exact details.

## Training/Test split
Since the positive (blighted) and negative (non-blighted) classes were *highly* imbalanced (0.02 to 0.98, respectively, [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling) was used to split the data.  The two classes were proportionately allocated into the training and test datasets using an 80/20 split.

# Machine Learning: Random Forest

## Deciding on a Classification Algorithm
All told, the final model had ~385,000 unique data points and 279 distinct features.  Some were one-hot encodings of class feature, some were sums of occurences, and others, like sale value, latitude, and longitude, were continuous.  The respective ranges of these were (0 or 1), (0 to ~5, by discrete amounts), (0 to ~1e7), and (42.25 to 42.45 and -82.91 to -83.23). Given this mix of variable types, ranges, and the high-dimensional feature space I decided to use a [Random Forest Classifier(RFC)](https://en.wikipedia.org/wiki/Random_forest) since the algorithm handles this type of heterogeneous data well and is quite robust to overfitting.  Random forests also have the useful property of being able to weight classes differently. In this case, positive examples will receive a 50x weighting to account for the 50:1 imbalance between negative and positive classes.

RFCs are an example of a bagging([Bootstrap AGGregating](https://en.wikipedia.org/wiki/Bootstrap_aggregating) algorithm where decision trees are fit on a sample drawn with replacement.  This practice increases the bias of the forest (with respect to the bias of a single non-random tree), but the subsequent averaging of individual tree predictions lowers the variance enough that overall model performance of the forest is generally superior to that of a single tree.

## Grid Search: Hyperparameter Tuning

Random forests give you several hyperparameters that can be adjusted to improve algorithm performance.  We will focus on three (given is the scikit-learn implementation names):
- *n_estimators*: the number of trees in the model. Generally, setting this value as high as computationally possible is optimal, provided other hyperparameters are adjusted to prevent overfitting.


- *min_samples_leaf*: how many examples are needed to produce a leaf node.  Increasing this value helps to reduce overfitting and increases robustness to outliers.


- *max_features*: what percentage of features to fit on each tree i.e. randomly subsampling the features.  Smaller values help to increase bias and reduce overfitting.

Generally, best practice is to first select an n_estimators as large as computationally possible.  Then, perform a [grid search](https://en.wikipedia.org/wiki/Hyperparameter_optimization#Grid_search) over likely values of the rest of the hyperparameters.

### F1-Score: Analyzing Model Performance on Unbalanced Classes

As stated before, this dataset is imbalanced, with only 2% of the data belonging to the minority blighted class.  A trivial classifier that predicted all buildings as non-blighted would have an accuracy of 98%, not too bad.  In practice, evaluating the model on accuracy would likely lead to a model that would only predict as blighted buildings that it was highly confident on.  We could say that such a model would have [high precision and low recall](https://en.wikipedia.org/wiki/Precision_and_recall).  Using [AUROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) as a metric would not work either since the data is imbalanced and AUROC only takes into account true and false positives. One way of getting around this issue is to use [F1-score](https://en.wikipedia.org/wiki/F1_score), the harmonic mean of precision and recall.  As the harmonic mean, F1-score requires both precision and recall to be high to have good performance.

### Cross-Validation
During the grid search, mean F1-score on the <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation.html>5-fold stratified cross-validation</a> of the training set was used as the evaluation metric.  Optimal parameters were found to be: *max_features*: 0.5, *min_samples_leaf*: 3, and *n_estimators*: 150.

# Results and Discussion

The model's performance was as follows:

- ***Accuracy*** (train) : 0.9826
- ***F1 Score*** (train): 0.678998
- ***Mean F1 Score*** (5-fold CV): 0.341827

Performance was similar test set:
- ***Accuracy*** (test): 0.9663
- ***F1 Score*** (test): 0.346445

So why is the F1 score so low?  Let's take a quick look at the confusion matrices for the training and test data:

<center><strong>Training</strong></center>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Predicted Negative</th>
      <th>Predicted Positive</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Actual Negative</th>
      <td>296480</td>
      <td>5310</td>
    </tr>
    <tr>
      <th>Actual Positive</th>
      <td>35</td>
      <td>5653</td>
    </tr>
  </tbody>
</table>


<center><strong>Test</strong></center>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Predicted Negative</th>
      <th>Predicted Positive</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Actual Negative</th>
      <td>73590</td>
      <td>1857</td>
    </tr>
    <tr>
      <th>Actual Positive</th>
      <td>735</td>
      <td>687</td>
    </tr>
  </tbody>
</table>


Two things immediately jump out:
- First, recall is much better on the training set than the test set.  This suggests that some aspect of the positive examples in the test set was not captured in the training set.
- Second, precision is lacking in the model's performance on both sets of data.  It is predicting a lot of negative examples as positive.

Let's investigate the first issue by looking at which data points our algorithm is incorrectly classifying:

<img src="files/figs/recall_test.png" alt="feature_importance" style="height: 600px;"/>

From the map we can see that a significant number of the false negatives (actual blighted buildings predicted as non-blighted) are from areas where the training set didn't have much data.  This implies that collecting more geographically diverse examples of blighted buildings might improve performance, if possible.

As to the second issue: we can get an idea of what features our model is weighting using a parameter called the [feature importance](https://en.wikipedia.org/wiki/Random_forest#Variable_importance).

<img src="files/figs/feature_importance.png" alt="feature_importance" style="height: 600px;"/>

From the feature importance graph we can see that the only features getting a lot of weight are the tax status of the property, the coordinate location of the property, and the estimated sale value.  These features are present on every data point, so it makes sense that they would be more important to the algorithm, but still one would think that the blight violation features would be more predictive.

Further inspection reveals that the problem stems from our "ground truth" label of blight.  Lacking a better proxy for blightedness, we used city of Detroit demolition data to classify a building as blighted or not.  This leads to buildings that are actually blighted, but haven't been demolished yet, to be erroneously classified as non-blighted in our model.  In fact, after much digging, I couldn't find one building that was incorrectly identified as blighted by the algorithm that proved to be non-blighted upon visual inspection via Google Maps. For example, all of the following buildings were "false positives" that were ground-truth labeled as non-blighted using the demolitions data to define blightedness:

<img src="files/figs/487_algonquin.png" alt="" style="height: 400px;"/>
<img src="files/figs/2961_basset.png" alt="" style="height: 400px;"/>
<img src="files/figs/5083_buckingham.png" alt="" style="height: 400px;"/>
<img src="files/figs/8213_fielding.png" alt="" style="height: 400px;"/>

Algorithms are only as good as the data you feed them.  For our model to have good performance, we would need a better indicator as to a building's blighted status than city demolitions data.  The best way to do this would be in-person site inspections to generate a training set.  Barring that, I would bet you could get a good indication of whether a building has been abandoned or not by analyzing its utilities data.  Since I don't have access to data for that, this is where the analysis ends for now. 