---

_You are currently looking at **version 1.1** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-machine-learning/resources/bANLa) course resource._

---

## Understanding and Predicting Property Maintenance Fines


The Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)) and the Michigan Student Symposium for Interdisciplinary Statistical Sciences ([MSSISS](https://sites.lsa.umich.edu/mssiss/)) have partnered with the City of Detroit to help solve one of the most pressing problems facing Detroit - blight. [Blight violations](http://www.detroitmi.gov/How-Do-I/Report/Blight-Complaint-FAQs) are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

All data for this project is provided through the [Detroit Open Data Portal](https://data.detroitmi.gov/). 

<br>

**File descriptions** (Use only this data for training your model!)

    train.csv - the training set (all tickets issued 2004-2011)
    test.csv - the test set (all tickets issued 2012-2016)
    addresses.csv & latlons.csv - mapping from ticket id to addresses, and from addresses to lat/lon coordinates. 
     Note: misspelled addresses may be incorrectly geolocated.

<br>

**Data fields**

train.csv & test.csv

    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    
train.csv only

    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction] 
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___


In [25]:
import timeit
import pandas as pd
import numpy as np

def dftreat(df_treated, agency_name_mapping, address_latlons,disposition_level1_mapping,disposition_level2_mapping):
    # drop fields that do not contribute to the prediction 
    df_treated.drop( ['violation_street_number','violation_street_name','violation_zip_code','mailing_address_str_number',
                    'mailing_address_str_name','non_us_str_code', 'inspector_name', 'violator_name',
                   'city','state','country'] , axis = 1, inplace = True )
    
    #  Agency name encoding 
    df_treated.replace(agency_name_mapping, inplace = True )
    
    #  Disposition_level1 encoding encoding 
    df_treated['disposition_level1'] =  df_treated['disposition']
    df_treated.replace(disposition_level1_mapping, inplace = True )
    
    #  Disposition_level1 encoding encoding 
    df_treated['disposition_level2'] =  df_treated['disposition']
    df_treated.replace(disposition_level2_mapping, inplace = True )
    
    df_treated.drop('disposition', axis = 1, inplace = True)
    
    # Address Latitude and Longitude 
    df_treated = pd.merge( df_treated, address_latlons, on = 'ticket_id')
    
    # Extracting Date time information from date fields. 
    df_treated['ticket_issued_date'] = pd.to_datetime(df_treated['ticket_issued_date'])
    df_treated['ticket_issued_year'] = df_treated['ticket_issued_date'].dt.year
    df_treated['ticket_issued_month'] = df_treated['ticket_issued_date'].dt.month
    df_treated['ticket_issued_dayofmonth'] = df_treated['ticket_issued_date'].dt.day
    df_treated['ticket_issued_dayofweek'] = df_treated['ticket_issued_date'].dt.weekday
    df_treated['ticket_issued_dayofyear'] = df_treated['ticket_issued_date'].dt.dayofyear
    df_treated['ticket_issued_hourofday'] = df_treated['ticket_issued_date'].dt.hour
    
    # df_train['ticket_issued_timeofday'] 
    df_treated['hearing_date'] = pd.to_datetime(df_treated['hearing_date'])
    df_treated['hearing_date_year'] = df_treated['hearing_date'].dt.year
    df_treated['hearing_date_month'] = df_treated['hearing_date'].dt.month
    df_treated['hearing_date_dayofmonth'] = df_treated['hearing_date'].dt.day
    df_treated['hearing_date_dayofweek'] = df_treated['hearing_date'].dt.weekday
    df_treated['hearing_date_dayofyear'] = df_treated['hearing_date'].dt.dayofyear
    df_treated['hearing_date_hourofday'] = df_treated['hearing_date'].dt.hour
    
    return df_treated
    


def blight_model():
  
    # Your code here
    pd.set_option('display.max_rows', 500)
    pd.set_option('display.max_columns', 500)
    
    ## Creating Addresses Data Frame with Lat & Longitude columns that can be merged with the train/test sets 
    addresses = pd.read_csv( 'addresses.csv')
    latlons= pd.read_csv('latlons.csv')
    address_latlons = pd.merge( addresses, latlons, on = 'address')
    address_latlons = address_latlons[ ['ticket_id', 'lat','lon']]
    
    # Train Set 
    df_train = pd.read_csv( 'train.csv' , encoding="cp1252")
    df_train = df_train[df_train['ticket_issued_date'].between( '2004-01-01', '2011-12-31')]
    # drop fields that are not in the test set and will cause data leakage because it contains compliance information
    df_train.drop(labels= ['payment_amount','payment_date','payment_status','balance_due',
                           'collection_status','compliance_detail'], axis = 1, inplace = True )
    
    agency_name_labels = df_train['agency_name'].astype('category').cat.categories.tolist()
    agency_name_mapping = { 'agency_name':{ a:b for a,b in zip( agency_name_labels, range(1, len( agency_name_labels)+1) )}}
    
    disposition_level1_mapping = { 'disposition_level1': {'Not responsible by City Dismissal':1,
                                     'Not responsible by Determination':1,
                                     'Not responsible by Dismissal':1,
                                     'PENDING JUDGMENT':3,
                                     'Responsible (Fine Waived) by Admis':2,
                                     'Responsible (Fine Waived) by Deter':2,
                                     'Responsible - Compl/Adj by Default':2,
                                     'Responsible - Compl/Adj by Determi':2,
                                     'Responsible by Admission':2,
                                     'Responsible by Default':2,
                                     'Responsible by Determination':2,
                                     'Responsible by Dismissal':2,
                                     'SET-ASIDE (PENDING JUDGMENT)':3} }
    
    
    disposition_level2_mapping = { 'disposition_level2': {'Not responsible by City Dismissal':1,
                                     'Not responsible by Determination':2,
                                     'Not responsible by Dismissal':1,
                                     'PENDING JUDGMENT':3,
                                     'Responsible (Fine Waived) by Admis':4,
                                     'Responsible (Fine Waived) by Deter':5,
                                     'Responsible - Compl/Adj by Default':6,
                                     'Responsible - Compl/Adj by Determi':7,
                                     'Responsible by Admission':8,
                                     'Responsible by Default':6,
                                     'Responsible by Determination':7,
                                     'Responsible by Dismissal':9,
                                     'SET-ASIDE (PENDING JUDGMENT)':3} }
    df_train = df_train[pd.notnull(df_train['compliance'])]

    y_train = df_train['compliance']
#     y_train.fillna(2,inplace = True)
    df_train.drop('compliance', axis = 1, inplace = True)
    
    
    df_train = dftreat(df_train , agency_name_mapping, address_latlons, disposition_level1_mapping,disposition_level2_mapping)

    df_train_size = len(df_train)
    # Test Set 
    df_test = pd.read_csv( 'test.csv' , encoding="cp1252")
    
    df_test = dftreat(df_test , agency_name_mapping, address_latlons, disposition_level1_mapping,disposition_level2_mapping)
    
    df_test_ticketid = df_test['ticket_id']
    #combined dataset 
    df_alldata = df_train.append(df_test)
    
    # remove columns not needed - note - grafitti status has mostly nulls 
    df_alldata.drop(['zip_code','ticket_id', 'violation_description','ticket_issued_date', 'hearing_date','grafitti_status'], axis = 1, inplace = True)
    
    df_alldata.replace( {'hearing_date_year':{np.nan:0},'hearing_date_month':{np.nan:0}, 'hearing_date_dayofmonth':{np.nan:0},
                        'hearing_date_dayofweek':{np.nan:0},'hearing_date_dayofyear':{np.nan:0},
                         'hearing_date_hourofday':{np.nan:0}} , inplace = True)
    
 
    df_alldata = pd.get_dummies( df_alldata , columns = ['violation_code'] ,prefix = 'code' )
  
    # compliance [target variable for prediction] 
    #  Null = Not responsible
    #  0 = Responsible, non-compliant
    #  1 = Responsible, compliant
    df_alldata.fillna(0, inplace = True)
    
    
    X_train = df_alldata.iloc[:df_train_size,:]
    X_test = df_alldata.iloc[-61001:,:]

#     X_train = df_alldata.iloc[:1000,:]
#     X_test = df_alldata.iloc[-100:,:]
#     y_train = y_train.iloc[:1000]
    
    from sklearn.metrics import recall_score, precision_score
    from sklearn.svm import SVC
    from sklearn.linear_model import LogisticRegression

    # Your code here
#     svclf = SVC()
#     svclf.fit( X_train, y_train )
#     y_proba = svclf.predict_proba(X_test)

    
        # Your code here
    lrclf = LogisticRegression()
    lrclf.fit( X_train, y_train )
    y_proba = lrclf.predict_proba(X_test)
    
    answer  = pd.Series(y_proba[:,1] , index = df_test_ticketid)

    return  answer 

print(blight_model())

print( print(timeit.timeit(blight_model, number=1 )))





  if self.run_code(code, result):
  np.exp(prob, prob)


ticket_id
284932    7.787504e-02
285362    3.075644e-02
285361    8.112787e-02
285338    6.578249e-02
285346    8.559653e-02
285345    6.486905e-02
285347    8.586825e-02
285342    1.857507e-01
285530    2.962304e-02
284989    4.992406e-02
285344    7.597584e-02
285343    2.952076e-02
285340    2.998811e-02
285341    7.712019e-02
285349    8.559181e-02
285348    6.486539e-02
284991    4.992462e-02
285532    4.935165e-02
285406    5.073905e-02
285001    4.860216e-02
285006    2.878466e-02
285405    2.670351e-02
285337    4.914333e-02
285496    7.484857e-02
285497    6.443802e-02
285378    3.053597e-02
285589    4.858626e-02
285585    6.409574e-02
285501    7.878969e-02
285581    2.965279e-02
285583    6.602232e-02
285372    3.100396e-02
285470    7.150606e-02
285475    7.150577e-02
285370    3.148540e-02
285503    7.599151e-02
285502    2.982304e-02
285411    4.793333e-02
285498    3.611716e-01
285414    4.794556e-02
285484    6.247356e-02
285499    7.598944e-02
285419    5.837941e-02
2

  return Timer(stmt, setup, timer, globals).timeit(number)


10.604039555588315
None


In [59]:
blight_model()

Unnamed: 0,ticket_id,lat,lon
0,22056,42.390729,-83.124268
1,77242,42.390729,-83.124268
2,77243,42.390729,-83.124268
3,103945,42.390729,-83.124268
4,138219,42.390729,-83.124268
5,177558,42.390729,-83.124268
6,27586,42.326937,-83.135118
7,22062,42.380516,-83.096069
8,22084,42.380570,-83.095919
9,61031,42.380570,-83.095919
