# Crimeless: Static Model

<img src="wilder_tempdata/243856847_dfd7b10570_b.jpg">


Is crime random? Or is there an underlying pattern to be discerned? This is the question that we set out to answer by comparing a variety of datasets to a comprehensive list of geo-tagged crime scores.

These crime scores were created as described in the "Static Data" ipython notebook. The static datasets analyzed include liquor licenses and property values. 

The models we use include: 
* SVM
* Logistic Regression
* Linear Regression
* Naive Bayes

In [1]:
import numpy as np
import time
import csv
import pickle

First we import and open the feature and danger dictionaries created in the static data notebook. 

In [2]:
featureDictInput = open('featureDict.pkl', 'rb')
featureDict = pickle.load(featureDictInput)

In [3]:
dangerDictInput = open('dangerDict.pkl', 'rb')
dangerDict = pickle.load(dangerDictInput)

Here we proceed to find and delete the two geotagged locations that occur in danger dictionary but not in the feature dictionary.

In [4]:
missingKeys = []
for key in dangerDict:
    if key not in featureDict:
        missingKeys.append(key)

print missingKeys
        
for k in missingKeys:
    del dangerDict[k]
    

[(42.32, -70.96), (42.31, -70.97)]


Next, we split the features and the danger scores into their respective X and Y arrays for the purposes of executing the models. We keep track of any problems with a count.

In [5]:
X = []
Y = []
count = 0
for key in dangerDict.keys():
    try:
        X.append(featureDict[key])
        Y.append(dangerDict[key])
    except:
        count += 1
print count

0


We split into training and test sets. Approximately 70% of the data is used for the training and 30% is used for the test. 

In [6]:
X_train = X[:120]
Y_train = Y[:120]
X_test = X[120:]
Y_test = Y[120:]

We also write a function to keep track of our error rate from each respective model's predictions. The error rate is the average difference between our predicted danger scores and the actual danger scores from the test set.

We chose to use this metric for determining the efficacy of our models because of the nature of the danger scores we are calculating. These scores are from 0-10, so an accuracy score of when we exactly match the correct value is less valuable here. Instead, it is helpful to know when we are closest to the actual danger scores.

In [41]:
def getErrorRate(Preds, Labels):
    errors = []
    for i in xrange(len(Preds)): 
        errors.append(np.absolute(float(Labels[i]) - float(Preds[i])))
    return float(sum(errors))/len(errors)

We can run this code in case we wish to switch between continuous data (with a decimal) and labelled data (integers).  

In [11]:
Y_train= map(lambda x: round(x, 0), Y_train)
Y_test = map(lambda x: round(x, 0), Y_test)
Y = map(lambda x: round(x, 0), Y)

## SVM

We set up a classifier using SVC. We then fit the classifier SVM on the training data. 

**DO WE WANT TO DO THE BEST PARAMETER THING?

In [25]:
from sklearn.svm import SVC
clfsvm = SVC()
Cs=[0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

from sklearn.grid_search import GridSearchCV

parameters = {"C": Cs}
fitmodel = GridSearchCV(clfsvm, param_grid = parameters, cv = 5)
fitmodel.fit(X_train, Y_train)
print "The best parameter is:", fitmodel.best_params_
best = fitmodel.best_estimator_
best.fit(X_train,Y_train)
SCMpredictions = best.predict(X_test)
getErrorRate(SCMpredictions, Y_test)

The best parameter is: {'C': 0.001}


144.0

In [36]:
clfsvm.fit(X_train, Y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [37]:
SVMPredictions = clfsvm.predict(X_test)
# clfsvm.score(X_test, Y_test)

In [38]:
getErrorRate(SVMPredictions, Y_test)

2.6481481481481484

In [None]:
# C = [0.1, 1.0, 10.0]
# kernel = ['linear', 'poly', 'rbf', 'sigmoid']
# best_error = 9999
# best_C = None
# best_kern = None
# best_SVM = None

# for c in C:
#     for kern in kernel:
#         optimalclfsvm = SVC(C=c, kernel=kern).fit(X_train, Y_train)
#         optimalSVMPReds = optimalclfsvm.predict(X_test)
#         errorRate = getErrorRate(optimalSVMPReds, Y_test)
#         if errorRate < best_error:
#             best_error = errorRate
#             best_C = c
#             best_kern = kern
#             best_SVM = optimalclfsv
#         print kern
#     print c

In [None]:
# best_SVM_danger_Predictions = best_SVM.predict(X)

In [None]:
# getErrorRate(best_SVM_danger_Predictions,Y_test)

## Logistic Regression

Next, we proceed to fit a logistic regression in the same manner as we did with the SVM.

In [12]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()
logReg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [40]:
logRegPreds = logReg.predict(X_test)

In [39]:
getErrorRate(logRegPreds, Y)

2.925925925925926

In [26]:
def cv_optimize(clf, parameters, X, y, n_folds, score_func = None):
    fitmodel = GridSearchCV(clf, param_grid = parameters, cv = n_folds, scoring = score_func)
    fitmodel.fit(X, y)
    return fitmodel.best_estimator_

In [27]:
reuse_split=dict(Xtrain=X_train, Xtest=X_test, ytrain=Y_train, ytest=Y_test)

In [28]:
from sklearn.metrics import confusion_matrix
def do_classify(clf, parameters, X,Y, featurenames, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5):
    X=X
    y=Y
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print confusion_matrix(ytest, clf.predict(Xtest))
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest

In [29]:
from sklearn.linear_model import LogisticRegression
clflog = LogisticRegression(penalty = "l1")
clflog, Xtrain, ytrain, Xtest, ytest = do_classify(clflog, {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]}, X_train, Y_train, u'RESP',1, reuse_split = reuse_split)

using reuse split
############# based on standard predict ################
Accuracy on training data: 0.19
Accuracy on test data:     0.07
[[0 0 0 1 1 0 0 0 0 0 0]
 [0 0 0 1 3 0 0 2 0 0 0]
 [0 1 0 1 1 0 0 2 0 0 0]
 [0 1 0 2 2 0 0 0 0 0 0]
 [0 0 0 0 2 0 0 1 0 0 0]
 [0 0 0 1 3 0 0 5 0 0 0]
 [0 0 0 0 0 0 0 5 0 0 0]
 [0 0 0 1 1 0 0 0 0 1 0]
 [0 0 0 1 3 0 0 2 0 1 0]
 [0 0 0 1 4 0 0 2 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 0]]
########################################################


In [30]:
clflog.predict(Xtest)

array([ 7.,  1.,  3.,  3.,  7.,  4.,  7.,  7.,  4.,  4.,  4.,  9.,  7.,
        7.,  3.,  7.,  4.,  4.,  4.,  4.,  4.,  3.,  3.,  3.,  4.,  4.,
        4.,  3.,  7.,  7.,  7.,  9.,  7.,  4.,  4.,  4.,  7.,  9.,  4.,
        3.,  7.,  7.,  4.,  7.,  1.,  7.,  4.,  4.,  4.,  3.,  7.,  7.,
        7.,  4.])

In [55]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:#for stuff like logistic regression
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:#for stuff like SVM
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    if labe!=None:
        for k in xrange(0, fpr.shape[0],labe):
            #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
            threshold = str(np.round(thresholds[k], 2))
            ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    return ax

In [57]:
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
with sns.color_palette("dark"):
    ax=make_roc("svm-all-features",clflog, Y_test, X_test, None, labe=250, proba=False, skip=50)

ValueError: bad input shape (54, 11)

## Linear Regression

Next, we use a linear regression to 

In [42]:
propValues = map(lambda x: x[0], X)
liquorValues = map(lambda x: x[1], X)
print "the correlation between property values and danger score is : ", np.corrcoef(propValues, Y)
print "the correlation between liquor licenses and danger score is : ", np.corrcoef(liquorValues, Y)

the correlation between property values and danger score is :  [[ 1.        -0.2281024]
 [-0.2281024  1.       ]]
the correlation between liquor licenses and danger score is :  [[ 1.         0.3713725]
 [ 0.3713725  1.       ]]


In [43]:
from sklearn.linear_model import LinearRegression
linReg = LinearRegression()
linReg.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [44]:
linRegPreds = linReg.predict(X_test)
linRegPreds = map(lambda x: max(0, x), linRegPreds)

In [45]:
linRegPreds[0:5]

[4.9378471700469566,
 0,
 4.7145269439540733,
 4.7656760032076972,
 4.929318870159813]

In [46]:
getErrorRate(linRegPreds, Y_test)

2.1549054761737523

## Naive Bayes

Finally, we use the Naive Bayes classification algorithm. 

In [47]:
Y_train_labels = map(lambda x: round(x, 0), Y_train)
Y_test_labels = map(lambda x: round(x, 0), Y_test)
Y_labels = map(lambda x: round(x, 0), Y)

In [48]:
from sklearn.naive_bayes import GaussianNB
naiveBayes = GaussianNB()
naiveBayes.fit(X_train, Y_train_labels)

GaussianNB()

In [49]:
NBPreds = naiveBayes.predict(X_test)

In [50]:
getErrorRate(NBPreds, Y_labels)

2.740740740740741