## Topic - Predicting Property Maintenance Fines using ML models  

___

This is the final project of the Coursera course "Applied Machine Learning in Python" offered by University of Michigan. Some modifications have been made to provide a better presentation and a more complete process.

This project is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)). 

The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this project, the task is to predict whether a given blight ticket will be paid on time.

___


**File descriptions**

    train.csv - the training set (all tickets issued 2004-2011)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates.  
    (Note: I did not include 'test.csv' here, since the test file does not have any ground-truth target values, which means the evaluation is limited. Instead, I used the 'train.csv' to split the data into training set and test set, and then moved forward to later steps.)
    

___

**Evaluation and Model Selection**

The model is to predict whether the corresponding blight ticket will be paid on time.  
  
The performance metrics (accuracy, precision, recall, auc) are given to compare different models.       

In [5]:
def split_train_test():
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    
    data = pd.read_csv('readonly/train.csv',encoding='ISO-8859-1',low_memory=False)
    # only include observations with target values 0 or 1
    data = data[(data['compliance']==0) | (data['compliance']==1)]
    address =  pd.read_csv('readonly/addresses.csv',encoding='ISO-8859-1',low_memory=False)
    latlons = pd.read_csv('readonly/latlons.csv',encoding='ISO-8859-1',low_memory=False)
    address_latlons = address.set_index('address').join(latlons.set_index('address'),how='left')
    # combine to get the whole dataset
    data_new = data.set_index('ticket_id').join(address_latlons.set_index('ticket_id'),how='left').reset_index()

    # delete columns that are correlated with target value to avoid data leakage
    train_columns_to_drop = ['payment_amount','payment_date','payment_status',\
                         'balance_due','collection_status','compliance_detail']
    data_new.drop(train_columns_to_drop,axis=1,inplace=True)
    # delete string variables
    string_columns_to_drop = ['agency_name','inspector_name','violator_name',\
                     'violation_street_number','violation_street_name','violation_zip_code',\
                     'mailing_address_str_number', 'mailing_address_str_name', \
                     'city', 'state', 'zip_code', 'non_us_str_code', 'country',\
                     'ticket_issued_date','hearing_date','violation_code','violation_description',\
                     'disposition','grafitti_status']
    data_new.drop(string_columns_to_drop,axis=1,inplace=True)

    #check and deal with missing values
    data_new.isnull().sum()
    data_new.lat.fillna(method='pad',inplace=True)
    data_new.lon.fillna(method='pad',inplace=True)

    # split_train_test
    X = data_new.drop('compliance',axis=1)
    y = data_new['compliance']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    
    return X_train, X_test, y_train, y_test

In [6]:
# Baseline: Dummy Classifier
from sklearn.dummy import DummyClassifier
X_train, X_test, y_train, y_test = split_train_test()
# Negative class (0) is most frequent
dummy_majority = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_majority.score(X_test,y_test)

0.9276457343007255

In [7]:
# Logistic Regression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, auc
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = split_train_test()
logit = LogisticRegression(C=100).fit(X_train, y_train)
logit_predicted = logit.predict(X_test)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, logit_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, logit_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, logit_predicted)))
fpr_logit, tpr_logit, _ = roc_curve(y_test, logit_predicted)
roc_auc_logit = auc(fpr_logit, tpr_logit)
print('AUC: {:.2f}'.format(roc_auc_logit))

Accuracy: 0.93
Precision: 0.76
Recall: 0.06
AUC: 0.53


In [9]:
# kNN Classification
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, auc
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = split_train_test()
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
knn_predicted = knn.predict(X_test)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, knn_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, knn_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, knn_predicted)))
fpr_knn, tpr_knn, _ = roc_curve(y_test, knn_predicted)
roc_auc_knn = auc(fpr_knn, tpr_knn)
print('AUC: {:.2f}'.format(roc_auc_knn))

Accuracy: 0.93
Precision: 0.43
Recall: 0.08
AUC: 0.53


In [10]:
# Gradient Boosted Decision Trees
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, auc
from sklearn.ensemble import GradientBoostingClassifier
X_train, X_test, y_train, y_test = split_train_test()
tree = GradientBoostingClassifier(learning_rate=0.01,max_depth=2,random_state=0).fit(X_train, y_train)
tree_predicted = tree.predict(X_test)
print('Accuracy: {:.2f}'.format(accuracy_score(y_test, tree_predicted)))
print('Precision: {:.2f}'.format(precision_score(y_test, tree_predicted)))
print('Recall: {:.2f}'.format(recall_score(y_test, tree_predicted)))
fpr_tree, tpr_tree, _ = roc_curve(y_test, tree_predicted)
roc_auc_tree = auc(fpr_tree, tpr_tree)
print('AUC: {:.2f}'.format(roc_auc_tree))

Accuracy: 0.93
Precision: 0.95
Recall: 0.09
AUC: 0.54


### Conclusions

___

a) The overall accuracy and AUC score of the above 3 models are similar;  
b) The baseline accuracy is 0.9276, which is very close to the 3 training models. This indicates that **there might be missing features, as training the model does not improve the accuracy a lot**;   
c) Precision and Recall scores are various in these 3 models. Usually, there is a tradeoff between precision and recall. **For this case, we would like to decrease the number of false positives (False positive means we wrongly classify someone into compliance, but actually he does not pay in time. This will lead to the failure of collecting money back and negatively affect the financial situation of relevant parties.)** So it is better to use a model with **higher precision score**. Based on the above, we can safely conclude that the **Gradient Boosted Decision Trees with learning_rate 0.01 and max_depth 2** is a best model to predict blight violations.