# Modeling city blight with Machine Learning

## By Jiancheng ***(jekyLL4168@icloud.com)***

### The full solution can be found on my [GitHub repo](https://github.com/duducheng/Kaggle/tree/master/Coursera-Blight), together with [this report](https://github.com/duducheng/Kaggle/blob/master/Coursera-Blight/Report.ipynb)

In [1]:
__author__ = "Jiancheng"

# Bonjour!

This project works as the  Capstone Project of [Coursera Data Science at Scale Specialization](https://www.coursera.org/specializations/data-science).

Project background can be found on the [GitHub repo](https://github.com/uwescience/datasci_course_materials/blob/master/capstone/blight/blightfight.md) of this course.

Basicly, this project aims at predicting which "buildings" (or an particular place) would be ***"abandoned"*** in the future, using features like crime rates, city violation records, 311 calls records, etc. All the data is public from [Detroit City Gov]( http://data.detroitmi.gov)

# Result

The Accuracy on the balanced dataset: **~75.84%**

The AUC on the balanced dataset: **~78.54%**

# Difficulties

This project is really pratical, while compared to most data mining competition like Kaggle, this project needs much more time to define our **purpose**, which makes it much open.

Based on my own experience on this project, it was really important to define what the "building" is, based on our data, which really influence the results of our project. Considering the fact that the data is indeed dirty, for even the geo-based location, like (
latitude, longitude) pairs, they are really not accurate -- not only because of the noise, but also by the fact that it's impossible to define the exact location in reality (like some crime). It's the most difficulty in the project, in my opinion.

1. At the first time, I use grids to define the location, while it does bad jobs in this case. In this approach, there will be some area, containing too many buidlings. In our project, it could be a large noise.

2. Then I tried clustering method like kmeans, while it has the same problem like the grid approach, so it works poorly.

3. Finally I found to use a "naive" approach... it's the address itself. It seems naive, while it may be the most approriate method in this particular case.

It's a great leason for me, **how hard it's to define the exact problem**. In the real project, it may be more difficult to find the so-called right question.

# Methodology

## 1. clean the data

This step is extremely important in this project, cause the raw dataset is really dirty.

Cleaning was made in many steps. I tried to make them:
* Get rid of some "strange" data, like the default value of detroit city. Some of them were cleaned by their address.
* Get rid of some points that seems far from Detroit
* Normalize the lat, lng with Max-Min.
* In the building list, the addresses have been cleaned

Things to improve: how to clean the address. 

Our "building" list was build upon the address, thus the address cleaning could be really important but also annoying, that is the reason why I stop here.

The clean dataset was in "clean/" folder.

The code was in many notebook, in "prepare" folder.

In [2]:
import pandas as pd

In [3]:
clean_permit = pd.read_pickle('clean/permit.pickle')
clean_violation = pd.read_pickle('clean/violation.pickle') 
clean_crime = pd.read_pickle('clean/crime.pickle')
clean_311 = pd.read_pickle('clean/311.pickle')
clusters = pd.read_csv('clean/buildings.csv')[['lat','lng']]

In [4]:
# only use the data after the year of 2005
clean_permit, clean_violation, clean_crime, clean_311 = map(lambda df: df[df.date.map(lambda x: x.year)>=2005],
                                                            [clean_permit, clean_violation, clean_crime, clean_311])

In [5]:
clean_permit.head(3)

Unnamed: 0,lat,lng,date,addr
0,0.714869,0.903169,2015-08-28,4331 BARHAM
1,0.720103,0.750076,2015-08-28,9707 BESSEMORE
2,0.761821,0.905622,2015-08-28,5315 BERKSHIRE


In [6]:
clean_violation.head(3)

Unnamed: 0,lat,lng,addr,date,JudgeAmt
38854,0.868804,0.852497,15051 YOUNG,2006-03-18,140.0
38855,0.727499,0.29446,14615 SNOWDEN,2006-03-24,250.0
38856,0.725672,0.436063,2561 FORT,2006-01-03,305.0


In [7]:
clean_crime.head(3)

Unnamed: 0,lat,lng,Category,date
0,0.579349,0.439459,0,2015-06-03
1,0.603046,0.223398,1,2015-03-01
2,0.91936,0.699583,2,2015-02-08


In [8]:
clean_311.head(3)

Unnamed: 0,lat,lng,Category,date
0,0.662794,0.336104,0,2015-04-12 01:01:10
1,0.953726,0.548767,0,2015-04-07 14:04:44
2,0.978316,0.864315,0,2015-08-15 00:03:44


In [9]:
clusters.head(3)

Unnamed: 0,lat,lng
0,0.714869,0.903169
1,0.720103,0.750076
2,0.761821,0.905622


# 2. Then, we start building the training dataset.

Instead of using a range to define a building, I used a nearest neigbor to define for one particular incident, which buildings it should belongs to. 

This idea can simplify the data buidling step, and also have some good behavior:

For many incident, it will influence indeed many buildings, and there should be a decay effect on the geo location. 

It could be easily add into the project with nearest neighbor idea, the reason why I didn't put it into the project is it introduced more hypermeters about the field, making the parameter tuning harder.

In [10]:
from sklearn.neighbors import NearestNeighbors
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler as Scaler

In [11]:
nbrs = NearestNeighbors(n_neighbors=1).fit(clusters)
def find_nearest(df):
    distances, indices = nbrs.kneighbors(df[['lat','lng']])
    df['neighbor'] = indices
    return df

clean_permit, clean_violation, clean_crime, clean_311 = map(find_nearest,[clean_permit, clean_violation, clean_crime, clean_311])

### add the features

In [12]:
# target
clusters['permit'] = clean_permit.groupby('neighbor').apply(lambda g: g['addr'].count())
clusters = clusters.fillna(0)

In [13]:
# violation judge amt and count
clusters['judge'] = clean_violation.groupby('neighbor').apply(lambda g: g['JudgeAmt'].mean())
clusters['violation'] = clean_violation.groupby('neighbor').apply(lambda g: g['JudgeAmt'].count())
clusters = clusters.fillna(0)

In [14]:
# crime count with category
clusters = pd.concat([clusters,pd.get_dummies(clean_crime['Category'],prefix='crime').groupby(clean_crime['neighbor']).sum()],axis=1)
clusters = clusters.fillna(0)

In [15]:
# 311 calls count with category
clusters = pd.concat([clusters,pd.get_dummies(clean_311['Category'],prefix='311').groupby(clean_311['neighbor']).sum()],axis=1)
clusters = clusters.fillna(0)

In [16]:
clusters.head(3)

Unnamed: 0,lat,lng,permit,judge,violation,crime_0,crime_1,crime_2,crime_3,crime_4,...,311_13,311_14,311_15,311_16,311_17,311_18,311_19,311_20,311_21,311_22
0,0.714869,0.903169,1.0,360.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.720103,0.750076,1.0,346.25,8.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.761821,0.905622,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### balance the positive and negative sample

In [17]:
def balanced_dtrain(raw, clf=True, sample=True, target='permit'):
    non_zero = raw[raw.sum(axis=1)!=0].copy()
    if clf:
        non_zero[target] = non_zero[target]>0
    positive = non_zero[non_zero[target]>0]
    negative = non_zero[non_zero[target]==0]
    if sample:
        negative = negative.sample(len(positive))
    return pd.concat([negative,positive]).sort_index()

In [18]:
dataset = balanced_dtrain(clusters).drop(['lat','lng'],axis=1)
X = dataset.drop('permit',axis=1)
y = dataset['permit']

In [19]:
X.head(3)

Unnamed: 0,judge,violation,crime_0,crime_1,crime_2,crime_3,crime_4,crime_5,crime_6,crime_7,...,311_13,311_14,311_15,311_16,311_17,311_18,311_19,311_20,311_21,311_22
0,360.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,346.25,8.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
y.head(3)

0    True
1    True
2    True
Name: permit, dtype: bool

# 3.Use the best model found by Cross Validation Searching

In [21]:
clf = GradientBoostingClassifier(n_estimators=100,max_depth=1)

In [22]:
print "Accuracy:", cross_val_score(clf, X, y, cv=5).mean()

Accuracy: 0.758794910946


In [23]:
print "AUC:", cross_val_score(clf, X, y, cv=5,scoring="roc_auc").mean()

AUC: 0.778558885367


# Conclusion

So far, we get a working model, not consider the time.

This model means, by the October of 2015, the accuracy of our model is about 76%.

# The idea which hasn't been tested, but may work

The model can be improve to some degree, by:

* Clean the address better
* Consider time to get more training data
* Consider the spatial decay of the incident
* Use convnet to capture the spatial relationship