This Notebook contains our solution to the Kaggle SF Crime problem
https://www.kaggle.com/c/sf-crime

We started with a basic exploration of the data.
- Count unique values for each variable
- View most common values
- XY plot of lat/long w. circles to indicate number of crimes
- Time series plots to see how category use changes over time

Interesting Points:
- Most crime on Friday, then Wednesday. Least on Sunday.
- X and Y latitude have same number of distinct values. Seem to be somehow linked to locations
  since, despite there being a lots of sig fig, they still can be frequency counted
- 800 Block of BRYANT ST has 4x+ more data points than anyplace else. Seems to link w/ most freq X and Y
- "Other Offenses" are common
- The dates with the most crime are new years day. Also the first of months.
- Note: Strange max value of Y = 90 for 67 values. These appear to be in Chicago, but the data has addresses in SF. We removed this data from our analysis.


In [1]:
#This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import csv
import datetime

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

In [3]:
data = pd.read_csv("train.csv")
# Count distinct for each variable:
print "There are a total of {:,}, values.".format(len(data))

for var, series in data.iteritems():
    print "There are a total of {:,} {}.".format(len(series.value_counts()), var)
# View All of Categories, PdDistrict, Resolution, DayOfWeek
variables = ["Category", "PdDistrict", "Resolution"]
x = data["Category"].value_counts()/len(data)
for col in variables:
    print "-------------------------------------------------------------------------"
    print "There are a total of {:,} distinct {} values, as follows: ".format(len(data[col].value_counts()), col)
    print data[col].value_counts()/len(data)
    print

There are a total of 878,049, values.
There are a total of 389,257 Dates.
There are a total of 39 Category.
There are a total of 879 Descript.
There are a total of 7 DayOfWeek.
There are a total of 10 PdDistrict.
There are a total of 17 Resolution.
There are a total of 23,228 Address.
There are a total of 34,243 X.
There are a total of 34,243 Y.
-------------------------------------------------------------------------
There are a total of 39 distinct Category values, as follows: 
LARCENY/THEFT                  0.199192
OTHER OFFENSES                 0.143707
NON-CRIMINAL                   0.105124
ASSAULT                        0.087553
DRUG/NARCOTIC                  0.061467
VEHICLE THEFT                  0.061251
VANDALISM                      0.050937
WARRANTS                       0.048077
BURGLARY                       0.041860
SUSPICIOUS OCC                 0.035777
MISSING PERSON                 0.029599
ROBBERY                        0.026194
FRAUD                          0.01

We did more of this type of analysis but are showing only a sample for this report.

Our first model was a simple KNN model. We tried a variety of N's for neighbors and both normalized and regular data. We also looked at just including a subset of crimes for it to predict. In the end, we chose N=1, normalized data, and the top 4 crimes. One problem with KNN was that it predicted 1's and 0's, which resulted in a low score (27). For the final KNN submission, we replaced the 0's with average probability of the crime. This submission got a score of 2.92

In [None]:
data = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
data = data[data.Y != 90]
stk_list = ['LARCENY/THEFT','OTHER OFFENSES','NON-CRIMINAL','ASSAULT']
data = data[data.Category.isin(stk_list)]
data['Dates'] = pd.to_datetime(data['Dates'])
data['Year'] = data.Dates.dt.year
data['Month'] = data.Dates.dt.month
data['Day'] = data.Dates.dt.day
data['Date'] = data.Dates.dt.date
data['Hour'] = data.Dates.dt.hour
data['DayOfYear'] = data.Dates.dt.dayofyear
data['WeekDay'] = data.Dates.dt.weekday
#
test['Dates'] = pd.to_datetime(test['Dates'])
test['Year'] = test.Dates.dt.year
test['Month'] = test.Dates.dt.month
test['Day'] = test.Dates.dt.day
test['Date'] = test.Dates.dt.date
test['Hour'] = test.Dates.dt.hour
test['DayOfYear'] = test.Dates.dt.dayofyear
test['WeekDay'] = test.Dates.dt.weekday
def add_date_diff(df):
    datetime_vector = pd.to_datetime(df['Dates'])
    date_vector = datetime_vector.dt.date
    date_diff_vector = (date_vector - date_vector.min()) / np.timedelta64(1, 'D')
    df['DateDiff'] = date_diff_vector

add_date_diff(data)
add_date_diff(test)

In [None]:
# Create random dev sample so we can see how that accuracy compares to our Kaggle results
np.random.seed(100)

rows = np.random.choice(data.index, size = len(data) / 10, replace = False)

dev = data.ix[rows]
train = data.drop(rows)

# Convert to Numpy Format
train_data = np.array(train[['DateDiff','X','Y']].values)
train_labels = np.array(train[['Category']].values.ravel())

dev_data = np.array(dev[['DateDiff','X','Y']].values)
dev_labels = np.array(dev[['Category']].values.ravel())

full_data = np.array(data[['DateDiff','X','Y']].values)
full_labels = np.array(data[['Category']].values.ravel())

test_data = np.array(test[['DateDiff','X','Y']].values)

# Normalize Data to Between 0-1
#a + (x-A)*(b-a)/(B-A) 
train_normed = 0 + (np.abs(train_data) - np.abs(train_data).min(axis=0))*(1-0)/(np.abs(train_data).max(axis=0) - np.abs(train_data).min(axis=0)) 
dev_normed = 0 + (np.abs(dev_data) - np.abs(dev_data).min(axis=0))*(1-0)/(np.abs(dev_data).max(axis=0) - np.abs(dev_data).min(axis=0)) 
test_normed = 0 + (np.abs(test_data) - np.abs(train_data).min(axis=0))*(1-0)/(np.abs(train_data).max(axis=0) - np.abs(train_data).min(axis=0)) 
full_normed = 0 + (np.abs(full_data) - np.abs(train_data).min(axis=0))*(1-0)/(np.abs(train_data).max(axis=0) - np.abs(train_data).min(axis=0)) 


In [None]:
# Use GridSearchCV to find a good number of neighbors.
#ks = {'n_neighbors': range(1,4)}
ks = {'n_neighbors': [1,2,3,4,5,6,7,8,9,10]}
KNNGridSearch = GridSearchCV(KNeighborsClassifier(), ks, scoring='f1')
KNNGridSearch.fit(train_normed, train_labels)
#KNNmodel.fit(train_normed, train_labels)
# Report out on the accuracies    
print "The scores for each k value was %s " % (KNNGridSearch.grid_scores_)
print "The best k value was %s with accuracy %.4f" % (KNNGridSearch.best_params_, KNNGridSearch.best_score_)


In [None]:
def create_submission(preds):
    labels = ["Id",
                "ARSON",
                "ASSAULT",
                "BAD CHECKS",
                "BRIBERY",
                "BURGLARY",
                "DISORDERLY CONDUCT",
                "DRIVING UNDER THE INFLUENCE",
                "DRUG/NARCOTIC",
                "DRUNKENNESS",
                "EMBEZZLEMENT",
                "EXTORTION",
                "FAMILY OFFENSES",
                "FORGERY/COUNTERFEITING",
                "FRAUD",
                "GAMBLING",
                "KIDNAPPING",
                "LARCENY/THEFT",
                "LIQUOR LAWS",
                "LOITERING",
                "MISSING PERSON",
                "NON-CRIMINAL",
                "OTHER OFFENSES",
                "PORNOGRAPHY/OBSCENE MAT",
                "PROSTITUTION",
                "RECOVERED VEHICLE",
                "ROBBERY",
                "RUNAWAY",
                "SECONDARY CODES",
                "SEX OFFENSES FORCIBLE",
                "SEX OFFENSES NON FORCIBLE",
                "STOLEN PROPERTY",
                "SUICIDE",
                "SUSPICIOUS OCC",
                "TREA",
                "TRESPASS",
                "VANDALISM",
                "VEHICLE THEFT",
                "WARRANTS",
                "WEAPON LAWS"
              ]
    head_str = ','.join(labels)

    num_cats = len(labels)
    
    # Make a dummy row to append to
    ids = np.arange(preds.shape[0])[np.newaxis].transpose()
    
    results = np.column_stack((ids, preds))

    # Write results to csv
    np.savetxt('sample.csv', results, fmt='%d', delimiter=',', header=head_str, comments='')

    return results

In [None]:
# Now that we've done this, let's run the KNN on the full train, apply to the test, then format.
KNNmodel = KNeighborsClassifier(n_neighbors=1)
KNNmodel.fit(full_normed, full_labels)
dev_predict = KNNmodel.predict_proba(test_normed).astype(int)
results = create_submission(dev_predict)

We should note that entering the other 35 categories with their average probability was doen by hand in Excel. We looked at doing it in python, but realized we wanted to pursue other models instead.

The second model was logistic regression. We tried several values of C, before settling on C=0.001. This improved our score to 2.69

In [None]:
from sklearn.linear_model import LogisticRegression
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)
clf.fit(train_data, train_labels)
print clf.best_params_
#print clf.grid_scores_
res = zip(*[(f1m, f1s.std(), p['C']) 
            for p, f1m, f1s in clf.grid_scores_])
plt.subplot(2,1,1)
plt.plot(res[2],res[0],'-o')
plt.subplot(2,1,2)
plt.plot(res[2],res[1],'-o')
plt.show()
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.001)
model.fit(train_normed, train_labels)
test_predict = np.array(model.predict_proba(test_normed))
results = create_submission(test_predict)

After submitting this model, we tried averaging the KNN and LR in various ways, but that did not improve the performance.

Our next attempt was BernoulliNB, whcih scored about the same as LR.

Then we tried GradientBoosting, which improved our score to 2.49

Finally, we used a neural net, which got our best score of 2.45