# Assault detection

Looking at reports of a crime, we're going to see whether we can detect the severity of an assault. It's part of a larger analysis [done by the LA Times](https://www.latimes.com/la-me-g-lapd-reclass-htmlstory.html).

## Imports

First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.

Also look at: https://investigate.ai/latimes-crime-classification/using-a-classifier-to-find-misclassified-crimes/

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)
pd.set_option('display.min_rows', 20)

## Reading in our data

Our dataset is going to be a database of crimes committed between 2008 and 2012. The data has been cleaned and filtered a bit, though, so we're only left with two columns:

* `CCDESC`, what criminal code was violated
* `DO_NARRATIVE`, a short text description of what happened

We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.

In [2]:
df = pd.read_csv("2008-2012.csv")
df.head(10)

Unnamed: 0,CCDESC,DO_NARRATIVE
0,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP WAS SEEN THROUGH SURVAILANCE CONCEALING SEVERAL ITEMS INTO HER SHOPPING AND PERSONAL BAG LEAVING WITHOUT PAYING DEPT STORE
1,VIOLATION OF COURT ORDER,DO-SUSP ARRIVED AT VICTS RESID AND ENTERED VICTS RESID IN VIOLATION OF RESTRAINING ORDER
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES
3,THEFT PLAIN - PETTY ($950 & UNDER),DO-UNK SUSP TOOK VICT PREPAID GIFT CARD SUSP PURCHASED PRODUCTS WITH ITEM
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS
5,THEFT OF IDENTITY,DO-UNK SUSP USED VICTS PERSONAL INFO FOR GAIN WITHOUT THE VICTS CONSENT ORKNOWLEDGE
6,SHOPLIFTING - PETTY THEFT ($950 & UNDER),DO-SUSP ENTERED MKT AND SEL ITEMS SUSP CONCEALED ITEMS AND EXITED STORE WOPAYING
7,BURGLARY,DO-UNK SUSP ENTERED VICTS RESIDENCE BY UNLOCKED FRONT DOOR SUSP REMOVED VCTICTS PROPERTY SUSP FLED LOC
8,OTHER MISCELLANEOUS CRIME,DO-SUSP ADMITTED TO PLACING 2010 REG TAG HE ILLEGALLY OBTAINED ON HIS LIC PLATE HIS VEH REG WAS STILL EXP
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830218 entries, 0 to 830217
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   CCDESC        830218 non-null  object
 1   DO_NARRATIVE  830218 non-null  object
dtypes: object(2)
memory usage: 12.7+ MB


## Filtering for assaults

First, filter the dataset so it **only includes assaults.** No burglary, no identity theft, no shoplifting: just every kind of assault.

In [4]:
df = df[df.CCDESC.str.contains("ASSAULT")]

In [5]:
df.head()

Unnamed: 0,CCDESC,DO_NARRATIVE
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V
11,BATTERY - SIMPLE ASSAULT,DO-V STATED THAT SUSP CONFRT HER WHEN SHE TRIED TO APPR HER HUSBAND SUSP AND V HUSBAND ARE FRNDS SUSP YELLED STAY AWAY FROM HIM AND PUSHED V
16,BATTERY - SIMPLE ASSAULT,DO-SUSPS WERE VERBALLY ABUSING VICT DURING WHICH TIME S1 STRUCK VICT THREETIMES ON THE BACK OF HIS LEFT SHOULDER


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165965 entries, 2 to 830209
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   CCDESC        165965 non-null  object
 1   DO_NARRATIVE  165965 non-null  object
dtypes: object(2)
memory usage: 3.8+ MB


In [7]:
# What are our possible crime descriptions?

df.CCDESC.unique()

array(['ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT',
       'BATTERY - SIMPLE ASSAULT', 'INTIMATE PARTNER - SIMPLE ASSAULT',
       'CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT',
       'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT',
       'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER',
       'INTIMATE PARTNER - AGGRAVATED ASSAULT', 'OTHER ASSAULT'],
      dtype=object)

In [8]:
pd.DataFrame(df['CCDESC'].value_counts())

Unnamed: 0,CCDESC
BATTERY - SIMPLE ASSAULT,71951
"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",43385
INTIMATE PARTNER - SIMPLE ASSAULT,42102
CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT,4297
INTIMATE PARTNER - AGGRAVATED ASSAULT,1606
CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT,1481
ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER,749
OTHER ASSAULT,394


## Converting to a yes/no question

There are a handful of kinds of assault listed in the dataset, but they boil down to either being "aggravated" or "simple". Aggravated assault is treated much more severely than simple assault.

|Description in dataset|Is it serious/aggravated?|
|---|---|
|BATTERY - SIMPLE ASSAULT|no|
|ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT|yes|
|INTIMATE PARTNER - SIMPLE ASSAULT|no|
|CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT|no|
|INTIMATE PARTNER - AGGRAVATED ASSAULT|yes|
|CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT|yes|
|ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER|yes|
|OTHER ASSAULT|no|



In [9]:
# Make a new `0`/`1` integer column that represents
# whether the assault was aggravated/serious or not.

df['is_aggravated'] = df.CCDESC.replace({
'BATTERY - SIMPLE ASSAULT': 0,
'ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT': 1,
'INTIMATE PARTNER - SIMPLE ASSAULT':0,
'CHILD ABUSE (PHYSICAL) - SIMPLE ASSAULT': 0,
'INTIMATE PARTNER - AGGRAVATED ASSAULT': 1,
'CHILD ABUSE (PHYSICAL) - AGGRAVATED ASSAULT':1,
'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER':1,
'OTHER ASSAULT': 0    
})

In [10]:
df.head(5)

Unnamed: 0,CCDESC,DO_NARRATIVE,is_aggravated
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES,1
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS,0
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V,0
11,BATTERY - SIMPLE ASSAULT,DO-V STATED THAT SUSP CONFRT HER WHEN SHE TRIED TO APPR HER HUSBAND SUSP AND V HUSBAND ARE FRNDS SUSP YELLED STAY AWAY FROM HIM AND PUSHED V,0
16,BATTERY - SIMPLE ASSAULT,DO-SUSPS WERE VERBALLY ABUSING VICT DURING WHICH TIME S1 STRUCK VICT THREETIMES ON THE BACK OF HIS LEFT SHOULDER,0


## Prepare our dataset

In [11]:
# Drop missing data
print('Started with', df.shape)
# We can either drop the rows with missing data in specific columns or just use
# df = df.dropna()
df = df.dropna(subset=['DO_NARRATIVE', 'is_aggravated'])
print('Ended with', df.shape)

Started with (165965, 3)
Ended with (165965, 3)


In [12]:
# Vectorize

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.DO_NARRATIVE)

In [13]:
words_df = pd.DataFrame(vectors.toarray(),columns=vectorizer.get_feature_names())

In [14]:
words_df.head(5)

Unnamed: 0,10,12,13,15,1yr,20,2x,2yrs,3x,3yrs,abdomen,able,about,above,abrasion,abuse,accused,across,adn,adv,advised,after,again,against,aggressive,air,alcohol,all,alley,almost,along,also,alt,altercation,am,an,and,anger,angered,angry,ankle,another,any,apartment,app,apparent,appr,apprchd,apprd,appro,...,walked,walking,wall,want,wanted,was,water,way,wb,weapon,went,were,west,westbound,western,what,whats,when,where,which,while,who,why,wife,will,window,wit,wit1,with,without,witness,wits,wood,wooden,words,work,would,wrist,wrists,wth,yard,year,years,yell,yelled,yelling,you,your,yr,yrs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110479,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.393616,0.0,0.0,0.0
1,0.0,0.0,0.0,0.440575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.178163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.117797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.275916,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.184454,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.252504,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.204037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.322264,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# pip install pystemmer


## Train/test split

In [16]:
# Set X (features) and y (labels)]

from sklearn.model_selection import train_test_split

X = words_df
y = df.is_aggravated

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Create a classifier

Now we'll use this dataset to build a classifier that can detect whether an assault is aggravated assault or simple assault. What kind should we use?

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

In [18]:
%%time
# clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
# clf = RandomForestClassifier(n_estimators=50)
# clf = MultinomialNB()
clf = LinearSVC()
clf.fit(X_train, y_train)

CPU times: user 3.1 s, sys: 19.3 ms, total: 3.12 s
Wall time: 3.16 s


LinearSVC()

## Evaluating our classifier

In [19]:
clf.score(X_test, y_test)

0.892605803528391

In [20]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not aggravated', 'aggravated'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1),axis=0)

Unnamed: 0,Predicted not aggravated,Predicted aggravated
Is not aggravated,0.95991,0.04009
Is aggravated,0.275571,0.724429


## Explaining our classifier

In [21]:
import eli5

In [22]:
eli5.show_weights(clf,top=(10,10), feature_names=vectorizer.get_feature_names())

Weight?,Feature
+3.883,stabbed
+3.869,shot
+3.850,fired
+3.015,shoots
+3.011,knife
+2.869,pointed
+2.868,handgun
+2.848,bat
+2.760,shooting
+2.612,machete


# Stemming/lemmatization

In [23]:
import Stemmer

In [24]:
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

In [25]:
# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

# Create a new StemmedTfidfVectorizer
vectorizer = StemmedTfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.DO_NARRATIVE)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,10,11,12,13,15,18th,1x,1yr,20,2x,2yr,390,3x,3yr,911,abdomen,abl,about,abov,abras,abus,abv,acceler,accus,across,adn,adv,advis,after,again,against,aggress,ago,aid,air,alcohol,all,alleg,alley,allow,almost,along,also,alt,alterc,am,an,and,anger,angri,...,want,warn,was,watch,water,way,wb,weapon,week,went,were,west,westbound,western,what,when,where,whi,which,while,white,who,wife,will,window,windshield,wit,wit1,with,witha,without,woke,wood,wooden,word,work,would,wound,wrap,wrench,wrestl,wrist,wth,yard,year,yell,you,your,yr,yrs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.103665,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.369356,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.439399,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.128562,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113155,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.177124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.219612,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.436526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.211785,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.334494,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
X = words_df
y = df.is_aggravated

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [27]:
clf = LinearSVC()
clf.fit(X_train, y_train)

LinearSVC()

In [28]:
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not aggravated', 'aggravated'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1),axis=0)

Unnamed: 0,Predicted not aggravated,Predicted aggravated
Is not aggravated,0.960764,0.039236
Is aggravated,0.26419,0.73581


In [29]:
eli5.show_weights(clf,top=(10,10), feature_names=vectorizer.get_feature_names())

Weight?,Feature
+4.382,shot
+3.884,stab
+3.705,fire
+3.268,handgun
+2.965,bat
+2.874,knife
+2.701,shoot
+2.509,machet
+2.341,ram
+2.296,brick


## Which assaults get misclassified?

In [30]:
clf.predict(words_df)

array([1, 0, 0, ..., 0, 1, 1])

In [31]:
df['prediction'] = clf.predict(words_df)

In [32]:
df.head()

Unnamed: 0,CCDESC,DO_NARRATIVE,is_aggravated,prediction
2,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-S APPRCHED V AND STATED ARE YOU GOING TO FCK ME V REPLIED NO SUSP PULL ED OUT A KNIFE AND STATED IM HERE TO HURT YOU BTCH S USED PROFANITIES,1,1
4,BATTERY - SIMPLE ASSAULT,DO-SUSP USED RIGHT FIST TO PUNCH VICT IN THE HEAD ONCE N PULL VICT HAIR FOR APPRX 15 SECONDS,0,0
9,BATTERY - SIMPLE ASSAULT,DO-S APPROACHED V IN VEH S SLAPPED AND LUNGGED AT V,0,0
11,BATTERY - SIMPLE ASSAULT,DO-V STATED THAT SUSP CONFRT HER WHEN SHE TRIED TO APPR HER HUSBAND SUSP AND V HUSBAND ARE FRNDS SUSP YELLED STAY AWAY FROM HIM AND PUSHED V,0,0
16,BATTERY - SIMPLE ASSAULT,DO-SUSPS WERE VERBALLY ABUSING VICT DURING WHICH TIME S1 STRUCK VICT THREETIMES ON THE BACK OF HIS LEFT SHOULDER,0,0


In [43]:
# Flagged cases to review
df[(df['prediction'] == 1) & (df['is_aggravated'] == 0)].head()

Unnamed: 0,CCDESC,DO_NARRATIVE,is_aggravated,prediction,predict_proba
88,BATTERY - SIMPLE ASSAULT,DO-S THREW ROCKS AT CAR AND VICTIM STRIKING V IN SHOULDER,0,1,0.303077
205,BATTERY - SIMPLE ASSAULT,DO-VICT AND SUSP INV IN DISPUTE OVER CAR REPOSSESSION VICT CLAIMS SUSP INTENTIONALLY REVERSED INTO VICTIMS LEG,0,1,0.467999
293,BATTERY - SIMPLE ASSAULT,DO--THE S POINTED A CAN OF HAIR SPRAY AND A LIGHTER AND THREATENED TO SET THE V ON FIRE V LEFT THE LOC IN FEAR FOR HER LIFE S BATTERED V TWO HITTING HER,0,1,0.621244
298,INTIMATE PARTNER - SIMPLE ASSAULT,DO-S AND V HAVE BEEN MARRIED FOR TEN TEARS S AND V BEGAN TO ARGUE S GRABBED A HAMMER AND SWUNG AT V ALMOST STRIKING V,0,1,0.425877
442,INTIMATE PARTNER - SIMPLE ASSAULT,DO-S APP V DRAGGED V ON THE GROUND AND BRANDISHED A KNIFE S IS EX BOYFRIEND,0,1,0.573614


# Sorting by predictions

In [39]:
# higher number -> more likely aggravated
# lower number -> more likely not aggravated

# IT'S NOT A PREDICTION OF PROBABILITY!!!
# but we are going name it anyway

# negative numbers = simple assault

df['predict_proba'] = clf.decision_function(words_df)

In [44]:
df[(df['prediction'] == 1) & (df['is_aggravated'] == 0)].sort_values(by='predict_proba',ascending=False)

Unnamed: 0,CCDESC,DO_NARRATIVE,is_aggravated,prediction,predict_proba
509809,BATTERY - SIMPLE ASSAULT,DO-S ATT TO STAB V WITH A KNIFE,0,1,3.297368
549646,INTIMATE PARTNER - SIMPLE ASSAULT,DO-SUSP STABBED VICT WITH GLASS BOTTLE,0,1,3.098069
132936,INTIMATE PARTNER - SIMPLE ASSAULT,DO-S STABBED V WITH A FORK,0,1,3.022746
611667,INTIMATE PARTNER - SIMPLE ASSAULT,DO-S STABBED V WITH FORK,0,1,3.022746
228905,INTIMATE PARTNER - SIMPLE ASSAULT,DO- V AND S WERE INVOLV IN VERBL ARGUMENT S THEN HIT V ON THE HEAD WITH A PAN AND ATT TO STAB V WITH KNIFE S CUT V WITH KNIFE,0,1,2.880787
57764,INTIMATE PARTNER - SIMPLE ASSAULT,DO-SUSP STABBED V,0,1,2.806269
747246,INTIMATE PARTNER - SIMPLE ASSAULT,DO-S ENTRD V REZ WOUT PERM S BKAM ANGRY AT V AND ATT TO STAB V W KNIFE,0,1,2.721365
118342,INTIMATE PARTNER - SIMPLE ASSAULT,DO- S AND V IN VERBQAL ARGUMENT S CHARGED V WITH KNIFE S CUTS STABS VICT,0,1,2.622226
695489,INTIMATE PARTNER - SIMPLE ASSAULT,DO-SUSP STABBED VICT,0,1,2.587077
131514,INTIMATE PARTNER - SIMPLE ASSAULT,DO-SUSP STABBED VICT,0,1,2.587077
