## SF Crime Kaggle Competition

The Kaggle competition, [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime), has given 11 years of incidents reported to the SFD Crime Incident Reporting. The goal is being able to predict the category of the crime with the information given to me in the dataset. Currently the highest score is 2.29303.

#### My Goals
I want to use this dataset to get an idea of what submitting a model to a kaggle compeittion looks like and also get a feeling for how far basic data analysis will get you in a competition. In terms of skill building I want to be able to optomize my models and create some visualizations.


#### Data Exploration
As usual lets start with some basic data exploration.

In [1]:
import pandas as pd
df = pd.read_csv('./data/train.csv')
df

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541
5,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.403252,37.713431
6,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122.423327,37.725138
7,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564
8,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-122.508194,37.776601
9,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802


This time around the data is categorical but there are only 8 independent variables and 1 dependent variable. My instinct tells me to still try Random Forest Classification but lets continue to explore the data a little bit.

In [2]:
# Here I've created a dataFrame that groups the Categories of crime and then the Addresses
# This returns the number of times a certain crime happened at each address.
crime_address = pd.DataFrame({'count': df.select_dtypes(include=['O']).groupby(['Category','Address']).size()}).reset_index()
crime_address

Unnamed: 0,Category,Address,count
0,ARSON,0 Block of 12TH ST,2
1,ARSON,0 Block of 14TH ST,1
2,ARSON,0 Block of 3RD ST,2
3,ARSON,0 Block of 4TH ST,1
4,ARSON,0 Block of 6TH ST,10
5,ARSON,0 Block of 9TH ST,2
6,ARSON,0 Block of ARKANSAS ST,1
7,ARSON,0 Block of BALDWIN CT,1
8,ARSON,0 Block of BAYVIEW ST,1
9,ARSON,0 Block of BEATRICE LN,1


In [3]:
# This reduces the information to show just the area with the highest frequency for each crime.
address_count = crime_address.groupby(['Category']).apply(lambda x: x[x['count']==x['count'].max()]).reset_index(1)

# Cleaning up the data, groupby doesn't always comeout pretty
address_count.drop(address_count.columns[[0, 1]], axis=1, inplace=True)
address_count = address_count.reset_index()

# I don't want to see rare crimes with multiple addresses tied for first with 1 occurance
address_count = address_count[address_count['count'] > 1]
address_count

Unnamed: 0,Category,Address,count
0,ARSON,800 Block of BRYANT ST,41
1,ASSAULT,800 Block of BRYANT ST,1926
2,BAD CHECKS,800 Block of BRYANT ST,12
3,BRIBERY,800 Block of BRYANT ST,12
4,BURGLARY,800 Block of BRYANT ST,384
5,DISORDERLY CONDUCT,1000 Block of POTRERO AV,104
6,DRIVING UNDER THE INFLUENCE,800 Block of BRYANT ST,41
7,DRUG/NARCOTIC,2000 Block of MISSION ST,1866
8,DRUNKENNESS,800 Block of BRYANT ST,100
9,EMBEZZLEMENT,800 Block of MARKET ST,42


In [4]:
# The number of crimes an address had the highest frequency for a crime
top_address = address_count.groupby('Address').size().reset_index(0)
top_address

Unnamed: 0,Address,0
0,100 Block of GGBRIDGE HY,1
1,1000 Block of POTRERO AV,3
2,1200 Block of PAGE ST,1
3,1400 Block of PHELPS ST,1
4,1500 Block of BAY SHORE BL,1
5,200 Block of INTERSTATE80 HY,1
6,2000 Block of MISSION ST,1
7,700 Block of KEARNY ST,1
8,800 Block of 3RD ST,1
9,800 Block of BRYANT ST,23


#### Address Break Down
The amount of crime is overwhelmingly clustered around one address. 800 Block of Bryant ST is the most common address for 23 types of crimes, with the second place address being the top for 3. At first glance many of the crimes that occur the most in  Bryant St. are violent crimes or serious property crimes. 

Since the crimes are clustered around one address, using the correlation between address and crime type probably won't be too useful.

Lets try looking at the crime descriptions.

In [5]:
crime_desc = df.groupby(['Category','Descript']).size().reset_index()
crime_desc

Unnamed: 0,Category,Descript,0
0,ARSON,ARSON,447
1,ARSON,ARSON OF A COMMERCIAL BUILDING,65
2,ARSON,ARSON OF A POLICE BUILDING,2
3,ARSON,ARSON OF A POLICE VEHICLE,1
4,ARSON,ARSON OF A VACANT BUILDING,33
5,ARSON,ARSON OF A VEHICLE,607
6,ARSON,ARSON OF AN INHABITED DWELLING,217
7,ARSON,ATTEMPTED ARSON,109
8,ARSON,"FIRE, UNLAWFULLY CAUSING",32
9,ASSAULT,"AGGRAVATED ASSAULT OF POLICE OFFICER, SNIPING",1


After looking at these results it looks like the description is a strong indicator for the category of the crime. The issue is I need to get my model to see the similarities between each of the differenct descriptions.

#### Oops...
Looks like the descript data is provided only in the training data : / . Well with that let me move on to some modeling.

In [6]:
# We'll be starting off with 
import sklearn.ensemble as sk
import numpy as np
rfc = sk.RandomForestClassifier(oob_score = True, n_estimators = 50, warm_start=True)
Y = df['Category'].astype('category')
Y = Y.cat.rename_categories(range(0,Y.nunique()))

X =  df.drop(['Category','Descript','Resolution'],1)
n = 0
for col in X.columns:
    if n < 4:
        X[col] = X[col].astype('category')
        X[col] = X[col].cat.rename_categories(range(0,X[col].nunique()))
    n += 1

Unnamed: 0,Dates,DayOfWeek,PdDistrict,Address,X,Y
0,389256,6,4,19790,-122.425892,37.774599
1,389256,6,4,19790,-122.425892,37.774599
2,389255,6,4,22697,-122.424363,37.800414
3,389254,6,4,4266,-122.426995,37.800873
4,389254,6,5,1843,-122.438738,37.771541
5,389254,6,2,1505,-122.403252,37.713431
6,389254,6,2,13322,-122.423327,37.725138
7,389254,6,0,18054,-122.371274,37.727564
8,389253,6,6,11384,-122.508194,37.776601
9,389253,6,1,17658,-122.419088,37.807802


In [7]:
# %time model = rfc.fit(X.ix[0:350000],Y.ix[0:350000])
# model.oob_score_

Pretty Much my Random Forest sucks, I'll try making additional features and see how things turn out.

In [8]:
# Breaking down Time stamp into Year, Month, and Day
X['Dates'] = pd.to_datetime(df['Dates'])
X['Year'] = X.Dates.dt.year
X['Month'] = X.Dates.dt.month
X['Day'] = X.Dates.dt.day
X.drop('Dates', 1,inplace=True)
X



Unnamed: 0,DayOfWeek,PdDistrict,Address,X,Y,Year,Month,Day
0,6,4,19790,-122.425892,37.774599,2015,5,13
1,6,4,19790,-122.425892,37.774599,2015,5,13
2,6,4,22697,-122.424363,37.800414,2015,5,13
3,6,4,4266,-122.426995,37.800873,2015,5,13
4,6,5,1843,-122.438738,37.771541,2015,5,13
5,6,2,1505,-122.403252,37.713431,2015,5,13
6,6,2,13322,-122.423327,37.725138,2015,5,13
7,6,0,18054,-122.371274,37.727564,2015,5,13
8,6,6,11384,-122.508194,37.776601,2015,5,13
9,6,1,17658,-122.419088,37.807802,2015,5,13


In [9]:
# RFC Round 2
n = 250
clf = sk.RandomForestClassifier(oob_score = True,n_jobs=-1, n_estimators = n)

# %time clf.fit(X[::10],Y[::10])



In [10]:
clf.oob_score_

AttributeError: 'RandomForestClassifier' object has no attribute 'oob_score_'

In [None]:
# model.score(X,Y)
# 0.24406610565014025 (depth 7, 250)
# 0.243709633517 500 7
# 0.244111661194 750 7
# 0.243918050132 1000 7
# 0.29118306609312239 250 15


Spent the week trying to get the Random Forest Working to output a decent oob_score_ but came up empty. I tried principal dimensional analysis, breaking apart the model into small pieces, tried different settings. The reason why I kept trying for so long is I couldn't figure out if it was the model itself or if it is just my computers memory limitations was severely inhibiting the model. Either way it doesn't matter, its time to move on and try something else.

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

In [None]:
model = gnb.fit(X,Y)
y_pred = model.predict(X)
y_pred

In [None]:
model.score(X,Y)

Naive bayes also returned a poor result.

In [9]:
from sklearn.svm import SVC
vcl = SVC()
# %time sv_model = vcl.fit(X[::100],Y[::100]) #29.6s



In [None]:
# %time sv_model.score(X[::10],Y[::10]) # 1min 20s .278
%time sv_model.score(X[::100],Y[::100]) #8.12s .976

In [31]:
rows = np.random.choice(X.index.values, X.shape[0] * .02)
Z = X.ix[rows]
T = Y.ix[rows]

SVM actually shows some promise when choosing every 100 entries once I try every 10 entries the score drops back down to 27%. Lets see what else I can do to clean to improve the score. 

In [37]:
%time vcl.fit(Z,T)

Wall time: 1min 46s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [40]:
%time vcl.score(Z,T)

Wall time: 31.8 s


0.22004555808656037

| Sample Size | Columns | Fit Time | Score Time | Score | Comments and Notes                               |
|-------------|---------|----------|------------|-------|--------------------------------------------------|
| 2%          | 8       | 2min 39s | 36.9s      | 95.9% | Used model on sample  may be due to over fitting |
| 2%          | 8       | 1 min 46s| 31.8       | 96.1% | results of model when run right after .fit()     |
| 2%          | 8       | 1 min 46s| 31.8       | 22.0% | Model used on different sample                   |
|             |         |          |            |       |                                                  |
|             |         |          |            |       |                                                  |

In [41]:
test = pd.read_csv('./data/test.csv')
test.drop('Id',axis=1, inplace=True)
n = 0
test['Dates'] = pd.to_datetime(test['Dates'])
test['Year'] = test.Dates.dt.year
test['Month'] = test.Dates.dt.month
test['Day'] = test.Dates.dt.day
test.drop('Dates', 1,inplace=True)

for col in test.columns:
    if n < 3:
        test[col] = test[col].astype('category')
        test[col] = test[col].cat.rename_categories(range(0,test[col].nunique()))
    n += 1