# Goal: to build a model that predicts if a weapon was used or not based on the attributes of that crime

This will use a binary outcome of true or false. I will try using bayesian model, logistic regression, random forrest. I will go through a lot of examples and not tune them much, and see which one does the best and move on from there to give it more attention.

In [1]:
import pandas as pd
import numpy as np
import pyspark
import numpy
from pyspark.sql import functions as F
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report



%matplotlib inline

In [2]:
df = pd.read_csv('clean_crime_data.csv')

I am not sure how many crime involved the use of a weapon. Lets look:

In [3]:
len(df[df.firearm_used_flag >=1]) # this feels like a cumbersome approach. Lets do something cool

401

In [4]:
spark = pyspark.sql.SparkSession.builder.appName('pandasToSparkDF').getOrCreate()

# create spark dataframes
crime_df = spark.createDataFrame(df)

crime_df.createOrReplaceTempView('crime')

print('crimes where weapon was used')

gun_crimes = spark.sql("""
select 
    count(distinct crime_id) as crime_count,
    description
from crime
where firearm_used_flag >= 1
and description not LIKE '%Weapons%'
group by 2 order by 1 desc
""")

gun_crimes.show()

print('all crimes')

all_crimes = spark.sql("""
select
count(distinct crime_id) as crime_count
from crime
""")

all_crimes.show()

#print('Weapons were used in {}% of the crimes in this data set')

crimes where weapon was used
+-----------+--------------------+
|crime_count|         description|
+-----------+--------------------+
|         93|  Aggravated Assault|
|         55|Aggravated Assaul...|
|          8|Non Aggravated As...|
|          6|       Armed Robbery|
|          5|Non Aggravated As...|
|          2|  Strong Arm Robbery|
|          1|                Rape|
|          1|            Homocide|
|          1|Kidnapping/Abduction|
+-----------+--------------------+

all crimes
+-----------+
|crime_count|
+-----------+
|      30400|
+-----------+



In [5]:
print('Weapons were used in {}% of the crimes in this data set'.format(round((gun_crimes.groupBy().sum().collect()[0][0]/
                                                                            all_crimes.groupBy().sum().collect()[0][0]),3)*100))

Weapons were used in 0.6% of the crimes in this data set


Ok. This might seem pretty bad but actually there are a ton of types of crimes that we can exclude to narrow our focus and give this percentage a bit more of a fighting chance! 

In [6]:
gc = gun_crimes.toPandas()
gc.drop([6], axis= 0, inplace = True)

data = df[df.description.isin(gc.description.unique())]
data.reset_index(inplace = True,drop = True)
data = data.drop_duplicates(subset=['crime_id'], keep = False) 
data.firearm_used_flag = np.where(data.firearm_used_flag >= 1,1,0)
data.dvflag = np.where(data.dvflag >= 1,1,0)

In [7]:
print('now we have {}% of the crimes in this data set involving a firearm'.format(round((len(data[data.firearm_used_flag>=1])/
                                                                                 len(data[data.firearm_used_flag<1]))*100,2)))

now we have 2.37% of the crimes in this data set involving a firearm


this will be much better! 

# Naieve Bayes Classifer 

I will be using the Complement Naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. Given that we are trying to predict an event that only occurs ~2% of the time, this is a good choice.

https://www.youtube.com/watch?v=CPqOCI0ahss

This is a really good video of explaining how a Naieve Bayes model works at a high level. Its really pretty simple.

In [8]:
from sklearn.naive_bayes import ComplementNB

# split the data, will use this same data for other models 
model_df = data.drop(columns=['crime_id','from_date','charge_id'])

description = pd.get_dummies(model_df['description'])
zipcode = pd.get_dummies(model_df['zip_code'])

model_df_2 = pd.concat([model_df,description,zipcode], axis = 1)

model_df_data = model_df_2.drop(columns=['firearm_used_flag','description','zip_code'])
X_train, X_test, y_train, y_test = train_test_split(model_df_data,model_df_2['firearm_used_flag'],test_size = .3,
                                                    random_state = 42) # changing from .15
# train the model
model = ComplementNB().fit(X_train, y_train)
predicted = model.predict(X_test)

# # put results to a confusion matrix
nb_results = pd.DataFrame(confusion_matrix(y_test, predicted), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
nb_results


Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,1842,317
actual_gun,8,48


###### Ok, this model feels alright. Lets break it down some:

In [9]:
def modelStats(results):
    accuracy = ((results.loc['actual_no_gun','pred_no_gun'] +results.loc['actual_gun','pred_gun'])/results.values.sum())*100
    mis_class = ((results.loc['actual_gun','pred_no_gun'] +results.loc['actual_no_gun','pred_gun'])/results.values.sum())*100
    true_pos = ((results.loc['actual_gun','pred_gun']/results.loc['actual_gun'].sum()))*100
    false_pos = ((results.loc['actual_no_gun','pred_gun']/results.loc['actual_no_gun'].sum()))*100
    true_neg = ((results.loc['actual_no_gun','pred_no_gun']/results.pred_no_gun.sum()))*100
    precision = ((results.loc['actual_gun','pred_gun']/results.pred_gun.sum()))*100
    prevalence = (results.loc['actual_gun'].sum()/results.values.sum())*100

    print('The model was {}% accuracte'.format(round(accuracy,2)))
    print('The model had a misclassification rate of {}%'.format(round(mis_class,2)))
    print('The model had a true positive rate of {}%'.format(round(true_pos,2)))
    print('The model had a false positive rate of {}%'.format(round(false_pos,2)))
    print('The model had a true negitive rate of {}%'.format(round(true_neg,2)))
    print('The model had a precision rate of {}%'.format(round(precision,2)))
    print('The model had a prevalence rate of {}%'.format(round(prevalence,2)))

In [10]:
print(classification_report(y_test,predicted))
print(modelStats(nb_results))

              precision    recall  f1-score   support

           0       1.00      0.85      0.92      2159
           1       0.13      0.86      0.23        56

   micro avg       0.85      0.85      0.85      2215
   macro avg       0.56      0.86      0.57      2215
weighted avg       0.97      0.85      0.90      2215

The model was 85.33% accuracte
The model had a misclassification rate of 14.67%
The model had a true positive rate of 85.71%
The model had a false positive rate of 14.68%
The model had a true negitive rate of 99.57%
The model had a precision rate of 13.15%
The model had a prevalence rate of 2.53%
None


# Logistic Regression 
Logistic regression used a logit function, which is basically a line that spans between 0 and 1. This is due to the formula for this line being 1/(1+e)^-z where e is Eulers number (2.71....) and z is a liner regression line (y=mx+b...) for the data. 

In [11]:
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression(solver= 'liblinear')

# using same data from the previous split
logmodel.fit(X_train, y_train)

log_pred = logmodel.predict(X_test)

log_results = pd.DataFrame(confusion_matrix(y_test, log_pred), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
log_results

Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,2159,0
actual_gun,54,2


In [12]:
print(classification_report(y_test,log_pred))
modelStats(log_results)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      2159
           1       1.00      0.04      0.07        56

   micro avg       0.98      0.98      0.98      2215
   macro avg       0.99      0.52      0.53      2215
weighted avg       0.98      0.98      0.96      2215

The model was 97.56% accuracte
The model had a misclassification rate of 2.44%
The model had a true positive rate of 3.57%
The model had a false positive rate of 0.0%
The model had a true negitive rate of 97.56%
The model had a precision rate of 100.0%
The model had a prevalence rate of 2.53%


This model is pretty bad out of the box. In reality, you could get a decent score by just guessing no gun every single time which is kind of what happened here. Lets see if we can improve it before giving up completely

In [13]:
from sklearn.feature_selection import RFE

rfe = RFE(logmodel,15)

rfe.fit(X_train, y_train.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

[False False False False False False False False False False False False
 False False False False  True False False  True False False  True False
  True  True False  True  True  True False False False False False False
 False False False False False False False  True  True False  True  True
 False  True False False False False False False False False False False
 False False False False False False False  True False False False False
 False False False False False False False False False False False  True
 False False False False False False False False False False False False
 False False False False False False False False False False]
[11 31 37  3 52 88 16  2 18 39 84 36  5 19 45 56  1 61 46  1 42 41  1 23
  1  1 65  1  1  1 75 76 87 89 66 71 91 79 49 73 64 69 60  1  1 47  1  1
 24  1 17 15 44 55 14  8 25 57 20 67 63 34 32 10  6 21 50  1 22 27 62 33
 26 78 43 35 12 48 74 28 51 53 81  1 40 38  4 29 13  7 30  9 59 58 86 85
 70 90 82 83 80 77 68 72 92 54]


In [14]:
feat = pd.concat([pd.DataFrame(X_train.columns, columns = ['name']), 
           pd.DataFrame(rfe.support_, columns = ['tf'])], axis = 1)
feat = feat[feat.tf == True]


In [15]:
cols = feat.name.unique()
X_train_rfe=X_train[cols]
X_test_rfe = X_test[cols]


logmodel.fit(X_train_rfe, y_train)

log_pred = logmodel.predict(X_test_rfe)

log_results = pd.DataFrame(confusion_matrix(y_test, log_pred), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
log_results

Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,2159,0
actual_gun,53,3


Nope, still a bad model

# Decision Trees 

In [16]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

dtree_pred = dtree.predict(X_test)

In [17]:
dtree_results = pd.DataFrame(confusion_matrix(y_test, dtree_pred), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
print(classification_report(y_test,dtree_pred))
modelStats(dtree_results)

display(dtree_results)

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      2159
           1       0.16      0.12      0.14        56

   micro avg       0.96      0.96      0.96      2215
   macro avg       0.57      0.55      0.56      2215
weighted avg       0.96      0.96      0.96      2215

The model was 96.16% accuracte
The model had a misclassification rate of 3.84%
The model had a true positive rate of 12.5%
The model had a false positive rate of 1.67%
The model had a true negitive rate of 97.74%
The model had a precision rate of 16.28%
The model had a prevalence rate of 2.53%


Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,2123,36
actual_gun,49,7


This model isnt that bad by its stats but again I feel like its basically just guessing no every time

# Random Forrest

In [18]:
from sklearn.ensemble import RandomForestClassifier

randf = RandomForestClassifier(n_estimators=150)
randf.fit(X_train, y_train)

randf_pred = randf.predict(X_test)

In [19]:
randf_results = pd.DataFrame(confusion_matrix(y_test, randf_pred), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
print(classification_report(y_test,randf_pred))
modelStats(randf_results)

display(randf_results)

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      2159
           1       0.20      0.04      0.06        56

   micro avg       0.97      0.97      0.97      2215
   macro avg       0.59      0.52      0.52      2215
weighted avg       0.96      0.97      0.96      2215

The model was 97.2% accuracte
The model had a misclassification rate of 2.8%
The model had a true positive rate of 3.57%
The model had a false positive rate of 0.37%
The model had a true negitive rate of 97.55%
The model had a precision rate of 20.0%
The model had a prevalence rate of 2.53%


Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,2151,8
actual_gun,54,2


# SVM

In [21]:
from sklearn.svm import NuSVC, SVC

clf = SVC(gamma='scale', max_iter=25)
clf.fit(X_train, y_train)

clf_pred = clf.predict(X_test)



In [22]:
clf_results = pd.DataFrame(confusion_matrix(y_test, clf_pred), columns=['pred_no_gun','pred_gun'],
             index = ['actual_no_gun','actual_gun'])
print(classification_report(y_test,clf_pred))
modelStats(clf_results)

display(clf_results)

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      2159
           1       0.08      0.20      0.12        56

   micro avg       0.93      0.93      0.93      2215
   macro avg       0.53      0.57      0.54      2215
weighted avg       0.96      0.93      0.94      2215

The model was 92.55% accuracte
The model had a misclassification rate of 7.45%
The model had a true positive rate of 19.64%
The model had a false positive rate of 5.56%
The model had a true negitive rate of 97.84%
The model had a precision rate of 8.4%
The model had a prevalence rate of 2.53%


Unnamed: 0,pred_no_gun,pred_gun
actual_no_gun,2039,120
actual_gun,45,11


#### Conclusion
None of these models are that good out of the box. I think that is partially because of the low amount of events we are trying to predict. The other alternative is that this just is not a very good problem to try and predict an outcome for. I will be using the Decision Tree model and will do some further work on it to see if I can get a better result.