## Applying the Classifier

Let's apply the classifier to the data we tried to manaully code

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)

%matplotlib inline



## Reading the data

In [2]:
X_train = pd.read_csv('assaults_downgraded_train.csv', index_col=0)
X_test_with_answers = pd.read_csv('assaults_downgraded_test_with_answers.csv', index_col=0)
X_test = pd.read_csv('assaults_downgraded_test.csv', index_col=0).drop(columns='downgraded').rename(columns={'serious': 'serious_you'})
X_test

Unnamed: 0,CCDESC,DO_NARRATIVE,serious_you
483580,,DO- S AND V BECAME INVOLV IN AN ARGUMENT S BECAME UPSET AND STRUCK V IN THE FACE WITH A CLOSED FIST FIVE TIMES,0
745059,,DO-VICT AND SUSP INVOLVED IN A VERBAL ARGUMENT SUSP SPIT ONCE IN THE VICTS FACE SUSP FLED ON BICYCLE,0
644873,,DO-SUSP AND VIC WERE INVLD IN A VERBAL ARGUMENT SUSP STRUCK VIC IN HAND WITH UNK OBJECT CAUSING HALF INCH LACERATION TO HIS LEFT THUMB,0
394517,,DO-WHILE VICT WALKING TO SCHOOL SHE WAS APPROACH BY SUSP WHO ALSO IS A STUDENT TOLD VICT COME HERE SUSP PUNCH VICT IN FACE WITH A FIST AND SLAP VICT,0
604009,,DO-SUSP GRABBED VICT BY THE SHIRT AND PUSHED VICT LEAVING VISIBLE INJURY,0
223707,,DO-S ATT TO PUSH V OFF OF HER BIKE,0
295037,,DO-SUSP PUSHED VICT DURING CHILD CUSTODY CHANGE,0
216580,,DO-SUSP STABBED VIC WITH UNK WEAPON MULTIPLE TIMES SUSP FLED IN UNK DIR,1
807867,,DO-VICT AND SUSP GOT INTO VERBAL ARGUMENT SUSP BECAME HEATED AND STRUCK VICT ON THE FACE STOMACH AND ARM,0
685433,,DO-V WAS STRUCK WITH A CLOSED FIST BY HER HUSBAND,0


## Vectorize & Classify

Vectorize and classify in one big cell!

In [3]:
from nltk.stem import SnowballStemmer
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

nltk.download('omw-1.4')

# Define stemmer function
stemmer = SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
        return lambda doc:(stemmer.stem(word) for word in analyzer(doc))
    
# vectorize from training set    
vectorizer = StemmedTfidfVectorizer(min_df=15, max_df=0.5)
X = vectorizer.fit_transform(X_train.DO_NARRATIVE)

# classify
y = X_train.serious
clf = LinearSVC()
clf.fit(X, y)

# get scores - cross validate
scores = cross_validate(clf, X, y, cv=10,
                        scoring=('accuracy', 'precision', 'recall', 'f1'))

# here are some other types of scores
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
scores_df = pd.DataFrame(scores)
display(scores_df.round(2))
pd.DataFrame(scores)[
    ['fit_time', 'score_time', 'test_accuracy','test_precision','test_recall','test_f1']]\
    .mean().round(2)

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/areena.arora/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
0,1.57,0.02,0.87,0.78,0.67,0.72
1,1.57,0.01,0.87,0.77,0.67,0.72
2,1.54,0.01,0.87,0.77,0.67,0.72
3,1.55,0.01,0.87,0.77,0.66,0.71
4,1.55,0.02,0.87,0.78,0.67,0.72
5,1.57,0.01,0.87,0.77,0.67,0.72
6,1.57,0.02,0.87,0.77,0.67,0.72
7,1.56,0.02,0.87,0.78,0.65,0.71
8,1.56,0.02,0.87,0.76,0.66,0.71
9,1.55,0.01,0.87,0.77,0.66,0.71


fit_time          1.56
score_time        0.01
test_accuracy     0.87
test_precision    0.77
test_recall       0.66
test_f1           0.71
dtype: float64

## Making predictions

No matter what the terms are that point to a report being filed as Part I or Part II, at the end of the day we're interested in seeing **how good is our model as making predictions?** To test it out, we'll need to perform some predictions on content we know the answer to. Let's start by seeing how it does on some sample sentences.

In [4]:
X_test_vectors = vectorizer.transform(X_test.DO_NARRATIVE)

In [5]:
predictions = clf.predict(X_test_vectors)
predictions

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1])

In this case `1` means that yes, it was a serious assault. Let's try a few more.

You can see that offenses that include weapons tend to be predicted as serious offenses, while ones involving punching or other direct physical contact are classified as simple assault.

Instead of just looking at which category a report was put in, we can also look at **the score the classifier used for the prediction.**

In [6]:
prediction_score = clf.decision_function(X_test_vectors)
prediction_score

array([-1.21226456, -1.32123086, -0.26566673, -1.02701054, -1.63450391,
       -1.04066259, -1.19610637,  1.02885728, -0.95500414, -1.25463411,
       -0.85691186, -0.83979128, -0.84316617, -1.01935543, -0.92608457,
       -0.4891634 , -1.37629669, -0.56762547, -0.99604453, -0.96656839,
       -0.71391027, -1.3476531 , -1.134269  , -0.45375044, -0.60588183,
       -1.52547708, -0.32429662, -0.64301664, -1.14964791,  0.28600608,
       -1.1692806 , -1.13641987, -0.77059868, -1.25047744, -1.30711862,
       -0.65104045, -0.9716887 , -0.60207691, -0.55921009,  0.01883554,
       -1.10562858,  0.97422517, -0.51896331, -0.60279691,  0.30007348,
       -1.97960084, -0.60923845,  1.83560777, -1.20340823, -1.70684105,
       -0.97737373, -1.14817886, -0.28447895, -1.59650136, -1.07821817,
       -1.05441132, -0.9557595 ,  0.90628235, -0.73098711, -1.26572872,
       -0.35289685, -0.77650284,  0.55921893, -0.42028857,  0.11387191,
       -1.33814502, -1.40082938, -1.15409477, -0.32436796, -1.21

In [7]:
X_test_with_answers

Unnamed: 0,CCDESC,DO_NARRATIVE,serious,downgraded
483580,INTIMATE PARTNER - SIMPLE ASSAULT,DO- S AND V BECAME INVOLV IN AN ARGUMENT S BECAME UPSET AND STRUCK V IN THE FACE WITH A CLOSED FIST FIVE TIMES,0,0
745059,BATTERY - SIMPLE ASSAULT,DO-VICT AND SUSP INVOLVED IN A VERBAL ARGUMENT SUSP SPIT ONCE IN THE VICTS FACE SUSP FLED ON BICYCLE,0,0
644873,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP AND VIC WERE INVLD IN A VERBAL ARGUMENT SUSP STRUCK VIC IN HAND WITH UNK OBJECT CAUSING HALF INCH LACERATION TO HIS LEFT THUMB,1,0
394517,BATTERY - SIMPLE ASSAULT,DO-WHILE VICT WALKING TO SCHOOL SHE WAS APPROACH BY SUSP WHO ALSO IS A STUDENT TOLD VICT COME HERE SUSP PUNCH VICT IN FACE WITH A FIST AND SLAP VICT,0,0
604009,INTIMATE PARTNER - SIMPLE ASSAULT,DO-SUSP GRABBED VICT BY THE SHIRT AND PUSHED VICT LEAVING VISIBLE INJURY,0,0
223707,BATTERY - SIMPLE ASSAULT,DO-S ATT TO PUSH V OFF OF HER BIKE,0,0
295037,BATTERY - SIMPLE ASSAULT,DO-SUSP PUSHED VICT DURING CHILD CUSTODY CHANGE,0,0
216580,"ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT",DO-SUSP STABBED VIC WITH UNK WEAPON MULTIPLE TIMES SUSP FLED IN UNK DIR,1,0
807867,INTIMATE PARTNER - SIMPLE ASSAULT,DO-VICT AND SUSP GOT INTO VERBAL ARGUMENT SUSP BECAME HEATED AND STRUCK VICT ON THE FACE STOMACH AND ARM,0,0
685433,BATTERY - SIMPLE ASSAULT,DO-V WAS STRUCK WITH A CLOSED FIST BY HER HUSBAND,0,0


In [8]:
compare_df = X_test.merge(X_test_with_answers[['serious', 'downgraded']], left_index=True, right_index=True)
compare_df['serious_clf'] = predictions
compare_df['serious_clf_pct'] = prediction_score.round(2)
compare_df = compare_df.sort_values(by='serious_clf_pct', ascending=True)
compare_df


Unnamed: 0,CCDESC,DO_NARRATIVE,serious_you,serious,downgraded,serious_clf,serious_clf_pct
539918,,DO-SUSP CHARGED AT THE PPA PUSHED HER AND GRABBED HER SHIRT,,0,0,0,-1.98
813921,,DO-SUSP AND VICT WERE ARGUING SUSP APPROACHED VICT SUSP PUSHED VICT VICT PUSHED SUSP SUSP PUNCHED VICT IN FACE APPROX 2 TIMES,,0,0,0,-1.9
764244,,DO-S AND V BGN ARG S PSHD V AND RMV V ID FRM HER PURSE V FLED INTO RESD,,0,0,0,-1.71
604009,,DO-SUSP GRABBED VICT BY THE SHIRT AND PUSHED VICT LEAVING VISIBLE INJURY,0,0,0,0,-1.63
690932,,DO-SUSP PUSHED VICT DURING AN ARGUMENT,,0,0,0,-1.6
416947,,DO-SUSP AND VICT LIVING TOGETHER AND DATING VICT AND SUSP ARGUING OUTSIDESUSP SLAPPED VICT IN FACE AND WALKED AWAY,0,0,0,0,-1.53
244373,,DO-SUSP AND VICT ARE HUSBAND AND WIFE SUSP AND VICT BECAME INVOLVED IN A VERBAL ARGUMENT SUSP THEN PUSHED VICT AND TOOK HER KEYS,,0,0,0,-1.45
592771,,DO-V AND S HAVE A CHILD IN COMMON AND ARE FORMER COHAB S AND V WERE INVOL IN AN ARGUMENT WHEN THE S PUNCHED THE V ON THE LEFT LEG LEAVING VISIBLE BRUISE,,0,0,0,-1.4
712826,,DO-S AND V ARE PSYCH PATIENTS AT OLIVE VIEW S PUNCHED V IN FACE,0,0,0,0,-1.38
700665,,DO-S AND V ARE COHABITANTS WITH ONE CHILD IN COMMON S AND V BECAME INVOLVED IN AN ARGUMENT AND S SLAPPED V AND PUSHED HER TO THE GROUND THEN FLED THE LOC,0,0,0,0,-1.35


# Discussion

What did you get wrong? What did the machine get wrong?

- What are precision errors?
- What are recall errors?

filter compare_df in the cells below...or save it to a CSV and open it up to answer the discussion questions

In [9]:
compare_df.to_csv('comapre_df.csv')