## SF Crime Kaggle Competition

The Kaggle competition, [San Francisco Crime Classification](https://www.kaggle.com/c/sf-crime), has given 11 years of incidents reported to the SFD Crime Incident Reporting. The goal is being able to predict the category of the crime with the information given to me in the dataset. Currently the highest score is 2.29303.

#### My Goals
I want to use this dataset to get an idea of what submitting a model to a kaggle compeittion looks like and also get a feeling for how far basic data analysis will get you in a competition. In terms of skill building I want to be able to optomize my models and create some visualizations.


#### Data Exploration
As usual lets start with some basic data exploration.

In [1]:
import pandas as pd
df = pd.read_csv('./data/train.csv')
df

Unnamed: 0,Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y
0,2015-05-13 23:53:00,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,2015-05-13 23:53:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,2015-05-13 23:33:00,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541
5,2015-05-13 23:30:00,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.403252,37.713431
6,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122.423327,37.725138
7,2015-05-13 23:30:00,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564
8,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-122.508194,37.776601
9,2015-05-13 23:00:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802


This time around the data is categorical but there are only 8 independent variables and 1 dependent variable. My instinct tells me to still try Random Forest Classification but lets continue to explore the data a little bit.

In [2]:
# Here I've created a dataFrame that groups the Categories of crime and then the Addresses
# This returns the number of times a certain crime happened at each address.
crime_address = pd.DataFrame({'count': df.select_dtypes(include=['O']).groupby(['Category','Address']).size()}).reset_index()
crime_address

Unnamed: 0,Category,Address,count
0,ARSON,0 Block of 12TH ST,2
1,ARSON,0 Block of 14TH ST,1
2,ARSON,0 Block of 3RD ST,2
3,ARSON,0 Block of 4TH ST,1
4,ARSON,0 Block of 6TH ST,10
5,ARSON,0 Block of 9TH ST,2
6,ARSON,0 Block of ARKANSAS ST,1
7,ARSON,0 Block of BALDWIN CT,1
8,ARSON,0 Block of BAYVIEW ST,1
9,ARSON,0 Block of BEATRICE LN,1


In [3]:
# This reduces the information to show just the area with the highest frequency for each crime.
address_count = crime_address.groupby(['Category']).apply(lambda x: x[x['count']==x['count'].max()]).reset_index(1)

# Cleaning up the data, groupby doesn't always comeout pretty
address_count.drop(address_count.columns[[0, 1]], axis=1, inplace=True)
address_count = address_count.reset_index()

# I don't want to see rare crimes with multiple addresses tied for first with 1 occurance
address_count = address_count[address_count['count'] > 1]
address_count

Unnamed: 0,Category,Address,count
0,ARSON,800 Block of BRYANT ST,41
1,ASSAULT,800 Block of BRYANT ST,1926
2,BAD CHECKS,800 Block of BRYANT ST,12
3,BRIBERY,800 Block of BRYANT ST,12
4,BURGLARY,800 Block of BRYANT ST,384
5,DISORDERLY CONDUCT,1000 Block of POTRERO AV,104
6,DRIVING UNDER THE INFLUENCE,800 Block of BRYANT ST,41
7,DRUG/NARCOTIC,2000 Block of MISSION ST,1866
8,DRUNKENNESS,800 Block of BRYANT ST,100
9,EMBEZZLEMENT,800 Block of MARKET ST,42


In [4]:
# The number of crimes an address had the highest frequency for a crime
top_address = address_count.groupby('Address').size().reset_index(0)
top_address

Unnamed: 0,Address,0
0,100 Block of GGBRIDGE HY,1
1,1000 Block of POTRERO AV,3
2,1200 Block of PAGE ST,1
3,1400 Block of PHELPS ST,1
4,1500 Block of BAY SHORE BL,1
5,200 Block of INTERSTATE80 HY,1
6,2000 Block of MISSION ST,1
7,700 Block of KEARNY ST,1
8,800 Block of 3RD ST,1
9,800 Block of BRYANT ST,23


#### Address Break Down
The amount of crime is overwhelmingly clustered around one address. 800 Block of Bryant ST is the most common address for 23 types of crimes, with the second place address being the top for 3. At first glance many of the crimes that occur the most in  Bryant St. are violent crimes or serious property crimes. 

Since the crimes are clustered around one address, using the correlation between address and crime type probably won't be too useful.

I'll stop here for now and instead start preparing the data for some black box modeling.

In [44]:
Y = df['Category'].astype('category')
Y = Y.cat.rename_categories(range(0,Y.nunique()))

X =  df.drop(['Category','Descript','Resolution'],1)
n = 0
for col in X.columns:
    if n < 4:
        X[col] = X[col].astype('category')
        X[col] = X[col].cat.rename_categories(range(0,X[col].nunique()))
    n += 1

Pretty Much my Random Forest sucks, I'll try making additional features and see how things turn out.

In [42]:
# Breaking down Time stamp into Year, Month, and Day
X['Dates'] = pd.to_datetime(df['Dates'])
X['Year'] = X.Dates.dt.year
X['Month'] = X.Dates.dt.month
X['Day'] = X.Dates.dt.day
X.drop('Dates', 1,inplace=True)

In [43]:
Y.to_csv('data/Y.csv')
X.to_csv('data/X.csv')


I'll Also Clean up the test information to make life easier

In [41]:
test = pd.read_csv('./data/test.csv')
test.drop('Id',axis=1, inplace=True)
n = 0
test['Dates'] = pd.to_datetime(test['Dates'])
test['Year'] = test.Dates.dt.year
test['Month'] = test.Dates.dt.month
test['Day'] = test.Dates.dt.day
test.drop('Dates', 1,inplace=True)

for col in test.columns:
    if n < 3:
        test[col] = test[col].astype('category')
        test[col] = test[col].cat.rename_categories(range(0,test[col].nunique()))
    n += 1

In [46]:
test.to_csv('./data/Z.csv')