# Challenge on predicting of resolvability of crimes for BATMAN


### Socio Team  or "Bat"Team 

<i>Goukam , Imad , Mohamed , Nassim (M2 AIC Paris_Sud) </i>

## Introduction
The data set was download in this url:<a>https://data.sfgov.org/Public-Safety/SFPD-Incidents-Current-Year-2015-/ritf-b9ki</a>. 
The data is thin, it contains

* Dates - timestamp of the crime incident
* Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
* Descript - detailed description of the crime incident (only in train.csv)
* DayOfWeek - the day of the week
* PdDistrict - name of the Police Department District
* Resolution - how the crime incident was resolved (only in train.csv)
* Address - the approximate street address of the crime incident 
* X - Longitude
* Y - Latitude




** The goal is to predict the <code>Resolution</code> column. The prediction quality is measured by RMSE**. 



In [164]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## Fetch the data and load it in pandas

In [165]:
import os

In [166]:
local_filename = 'data/original_data.csv'
data = pd.read_csv(local_filename)

In [167]:
data.shape

(153618, 12)

In [168]:
data.columns.values


array(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time',
       'PdDistrict', 'Resolution', 'Address', 'X', 'Y', 'Location'], dtype=object)

In [169]:
data.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)"
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,SOUTHERN,NONE,1100 Block of MISSION ST,-122.411626,37.77859,"(37.7785895740312, -122.411626152299)"
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,INGLESIDE,"ARREST, BOOKED",0 Block of LELAND AV,-122.404263,37.711339,"(37.7113387848327, -122.404262861765)"
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,INGLESIDE,NONE,200 Block of BLYTHDALE AV,-122.420557,37.710895,"(37.7108945814914, -122.420556751442)"
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,CENTRAL,NONE,800 Block of POST ST,-122.415844,37.787402,"(37.7874017655636, -122.41584375719)"


In [170]:
print min(data['Date'])
print max(data['Date'])

01/01/2015
12/31/2015


In [171]:
data['PdDistrict'].unique()

array(['SOUTHERN', 'INGLESIDE', 'CENTRAL', 'BAYVIEW', 'PARK', 'NORTHERN',
       'MISSION', 'TENDERLOIN', 'TARAVAL', 'RICHMOND'], dtype=object)

In [172]:
data['Category'].unique()

array(['ASSAULT', 'VANDALISM', 'OTHER OFFENSES', 'NON-CRIMINAL',
       'LARCENY/THEFT', 'VEHICLE THEFT', 'BURGLARY', 'ROBBERY', 'WARRANTS',
       'SUSPICIOUS OCC', 'WEAPON LAWS', 'DRUNKENNESS', 'TRESPASS',
       'FORGERY/COUNTERFEITING', 'DRUG/NARCOTIC', 'MISSING PERSON',
       'SECONDARY CODES', 'FRAUD', 'EMBEZZLEMENT',
       'SEX OFFENSES, FORCIBLE', 'BRIBERY', 'STOLEN PROPERTY',
       'DISORDERLY CONDUCT', 'ARSON', 'FAMILY OFFENSES', 'RUNAWAY',
       'DRIVING UNDER THE INFLUENCE', 'KIDNAPPING', 'PROSTITUTION',
       'SUICIDE', 'LIQUOR LAWS', 'EXTORTION', 'GAMBLING', 'BAD CHECKS',
       'SEX OFFENSES, NON FORCIBLE', 'LOITERING',
       'PORNOGRAPHY/OBSCENE MAT', 'TREA'], dtype=object)

In [173]:
data.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,SOUTHERN,"ARREST, BOOKED",800 Block of BRYANT ST,-122.403405,37.775421,"(37.775420706711, -122.403404791479)"
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,SOUTHERN,NONE,1100 Block of MISSION ST,-122.411626,37.77859,"(37.7785895740312, -122.411626152299)"
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,INGLESIDE,"ARREST, BOOKED",0 Block of LELAND AV,-122.404263,37.711339,"(37.7113387848327, -122.404262861765)"
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,INGLESIDE,NONE,200 Block of BLYTHDALE AV,-122.420557,37.710895,"(37.7108945814914, -122.420556751442)"
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,CENTRAL,NONE,800 Block of POST ST,-122.415844,37.787402,"(37.7874017655636, -122.41584375719)"


In [174]:
data['Resolution'].unique()

array(['ARREST, BOOKED', 'NONE', 'UNFOUNDED', 'ARREST, CITED',
       'JUVENILE BOOKED', 'EXCEPTIONAL CLEARANCE', 'PSYCHOPATHIC CASE',
       'NOT PROSECUTED', 'LOCATED',
       'CLEARED-CONTACT JUVENILE FOR MORE INFO', 'JUVENILE DIVERTED',
       'JUVENILE CITED', 'COMPLAINANT REFUSES TO PROSECUTE',
       'JUVENILE ADMONISHED'], dtype=object)

In [175]:
data = data.replace(to_replace="ARREST, BOOKED", value=1)
data = data.replace(to_replace="ARREST, CITED", value=1)
data = data.replace(to_replace="NONE", value=0)
data = data.replace(to_replace="UNFOUNDED", value=0)
data = data.replace(to_replace="JUVENILE BOOKED", value=0)
data = data.replace(to_replace="EXCEPTIONAL CLEARANCE", value=0)
data = data.replace(to_replace="PSYCHOPATHIC CASE", value=0)
data = data.replace(to_replace="NOT PROSECUTED", value=0)
data = data.replace(to_replace="LOCATED", value=0)
data = data.replace(to_replace="CLEARED-CONTACT JUVENILE FOR MORE INFO", value=0)
data = data.replace(to_replace="COMPLAINANT REFUSES TO PROSECUTE", value=0)
data = data.replace(to_replace="JUVENILE ADMONISHED", value=0)
data = data.replace(to_replace="JUVENILE DIVERTED", value=0)
data = data.replace(to_replace="JUVENILE CITED", value=0)


*** Remove a "Address" To samplifier***

In [176]:
data = data.drop('Address', 1)
data = data.drop('Location', 1)


In [177]:
data.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,X,Y
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,SOUTHERN,1,-122.403405,37.775421
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,SOUTHERN,0,-122.411626,37.77859
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,INGLESIDE,1,-122.404263,37.711339
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,INGLESIDE,0,-122.420557,37.710895
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,CENTRAL,0,-122.415844,37.787402


In [178]:
data.dtypes

IncidntNum      int64
Category       object
Descript       object
DayOfWeek      object
Date           object
Time           object
PdDistrict     object
Resolution      int64
X             float64
Y             float64
dtype: object

*** Replace a San f PdDistrict with Gotham city district *** (<a>https://en.wikipedia.org/wiki/Gotham_City</a>)

In [179]:
data['PdDistrict'].unique()

array(['SOUTHERN', 'INGLESIDE', 'CENTRAL', 'BAYVIEW', 'PARK', 'NORTHERN',
       'MISSION', 'TENDERLOIN', 'TARAVAL', 'RICHMOND'], dtype=object)

In [180]:
data = data.replace(to_replace="SOUTHERN", value="Otisburg")
data = data.replace(to_replace="INGLESIDE", value="Burnley")
data = data.replace(to_replace="CENTRAL", value="East End")
data = data.replace(to_replace="BAYVIEW", value="Old Gotham")
data = data.replace(to_replace="PARK", value="Robinson Park")
data = data.replace(to_replace="NORTHERN", value="Chinatown")
data = data.replace(to_replace="MISSION", value="Bristol County")
data = data.replace(to_replace="TENDERLOIN", value="The Bowery")
data = data.replace(to_replace="TARAVAL", value="Diamond")
data = data.replace(to_replace="RICHMOND", value="Falcone Penthouse")

In [181]:
data.head()


Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,X,Y
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,Otisburg,1,-122.403405,37.775421
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,Otisburg,0,-122.411626,37.77859
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,Burnley,1,-122.404263,37.711339
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,Burnley,0,-122.420557,37.710895
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,East End,0,-122.415844,37.787402


# Save a new data 

In [183]:
data.to_csv('data/data_set.csv',index=False)

In [184]:
data = pd.read_csv('data/data_set.csv')
data.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,X,Y
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,Otisburg,1,-122.403405,37.775421
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,Otisburg,0,-122.411626,37.77859
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,Burnley,1,-122.404263,37.711339
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,Burnley,0,-122.420557,37.710895
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,East End,0,-122.415844,37.787402


### Split data


In [199]:
from sklearn.cross_validation import ShuffleSplit
len(np.array(data['Resolution']))
#a=np.array(data['Resolution'].values).shape
#print 
data = pd.read_csv('data/data_set.csv')
data.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,X,Y
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,Otisburg,1,-122.403405,37.775421
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,Otisburg,0,-122.411626,37.77859
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,Burnley,1,-122.404263,37.711339
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,Burnley,0,-122.420557,37.710895
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,East End,0,-122.415844,37.787402


In [186]:
#data_Train, data2  = np.array_split(data,2)
#data_Test, data_Valid = np.array_split(data2,2)

In [187]:
train = data[0:153618/3]
test_val = data[153618/3 +1:153618-1]

In [188]:
valid = test_val[0:2*len(test_val)/3]
test = test_val[2*len(test_val)/3+1:len(test_val)] 

In [189]:
print "size of train :", len(train)/153618.0 ,"%"
print "size of validation set:" , len(valid)/153618.0,"%"
print "size of test set:", len(test)/153618.0,"%"

size of train : 0.333333333333 %
size of validation set: 0.444433595021 %
size of test set: 0.222213542684 %


In [190]:
data_Train = train
data_Valid = valid
data_Test = test

# Train

In [191]:
y_train = data_Train.Resolution
#x_train = data_Train.drop(["Resolution"],axis = 1)

In [192]:
#y_train.to_csv('data/ref_train.solution',)

In [202]:
data_Train.to_csv('data/x_train.data',index=False)

In [194]:
train.head()

Unnamed: 0,IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,X,Y
0,160000108,ASSAULT,BATTERY,Thursday,12/31/2015,23:58,Otisburg,1,-122.403405,37.775421
1,166004914,VANDALISM,"MALICIOUS MISCHIEF, VANDALISM",Thursday,12/31/2015,23:55,Otisburg,0,-122.411626,37.77859
2,160000095,ASSAULT,INFLICT INJURY ON COHABITEE,Thursday,12/31/2015,23:54,Burnley,1,-122.404263,37.711339
3,160038137,OTHER OFFENSES,VIOLATION OF RESTRAINING ORDER,Thursday,12/31/2015,23:51,Burnley,0,-122.420557,37.710895
4,166002930,NON-CRIMINAL,LOST PROPERTY,Thursday,12/31/2015,23:50,East End,0,-122.415844,37.787402


# Test

In [203]:
y_test = data_Test.Resolution
x_test = data_Test.drop(["Resolution"],axis = 1)

In [204]:
y_test.to_csv('data/ref_test.solution',index=False)

In [205]:
x_test.to_csv('data/x_test.data',index=False)

In [206]:
y_test.head()

119481    0
119482    0
119483    1
119484    0
119485    0
Name: Resolution, dtype: int64

# Validation

In [207]:
y_valid = data_Valid.Resolution
x_valid = data_Valid.drop(["Resolution"],axis = 1)

In [208]:
y_valid.to_csv('data/ref_valid.solution',index=False)

In [209]:
x_valid.to_csv('data/x_valid.data',index=False)