# Checking the LAPD's crime classifications

Examining the categorization of crimes by the LAPD, The Times scrutinized violent crime data spanning from 2005 to 2012. The analysis revealed a discrepancy, indicating that approximately 14,000 severe assaults were mistakenly classified as minor offenses, resulting in an artificial reduction of the city's reported crime rates. The investigative process involved the application of an algorithm that employed two machine learning classifiers. These classifiers assessed concise crime descriptions to distinguish between minor and serious assaults, unveiling the extent of misclassifications within the LAPD's crime data

This project was sourced from a Github repository to demonstrate the application of Machine Learning in different areas..

In [1]:
import csv
import nltk
from nltk.util import ngrams
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from nltk.classify import MaxentClassifier
from nltk.stem.snowball import SnowballStemmer
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.feature_extraction.text import TfidfTransformer

## Stemming and stop words

We're going to clean up the crime descriptions in two steps. First, we're going to [stem](https://en.wikipedia.org/wiki/Stemming) the words -- this reduces the words to their root in order to limit differences based on tense or whether they appear in the plural or possessive form. Then, we're going to strip out a custom list of [stop words](https://en.wikipedia.org/wiki/Stop_words).

In [2]:
# Define a standard snowball stemmer
STEMMER = SnowballStemmer('english')
# Make a list of stopwords, including the stemmed versions
# These are words that have no impact on the classification, and
# can even occasionally mess up the classifier.
STOPWORDS = [
    'susp',
    'susps',
    's',
    'v',
    'in',
    'ppa',
    'vict',
    'the',
    'and',
    '&',
    '-s',
    'after',
    'for',
    'heard',
    'second',
    'avoid',
    'hold',
    'holding',
    'retrieved',
    'battery',
    'fist',
    'of',
    'to',
    'a',
]
STOPWORDS += [STEMMER.stem(i) for i in STOPWORDS]
STOPWORDS = list(set(STOPWORDS))

## Tokenize

This is a function to take a description and break it up into the individual "features" we're going to use to classify it. We separate the description into individual words, then stem them and remove stop words. From there, we make a list of individual words and then combine them into [bigrams](https://en.wikipedia.org/wiki/Bigram).

In [3]:
def tokenize(description):
    """
    Takes LAPD description text, strips out unwanted words and text,
    and prepares it for the trainer.
    """
    # first lower case and strip leading/trailing whitespace
    description = description.lower().strip()
    # kill the 'do-'s and any stray punctuation
    description = description.replace('do-', '').replace('.', '').replace(',', '')
    # make a list of words by splitting on whitespace
    words = description.split(' ')
    # Make sure each "word" is a real string / account for odd whitespace
    words = [STEMMER.stem(i) for i in words if i]
    words = [i for i in words if i not in STOPWORDS]
    # let's see if adding bigrams improves the accuracy
    bigrams = ngrams(words, 2)
    bigrams = ["%s|%s" % (i[0], i[1]) for i in bigrams]
    # bigrams = [i for i in bigrams if STEMMED_BIGRAMS.get(i)]
    # set up a dict
    out_dict = dict([(i, True) for i in words + bigrams])
    # The NLTK trainer expects data in a certain format
    return out_dict


## Grab the features

Loop through our example CSV and grab the features we're going to use to train our classifiers.

In [4]:
# open our sample file and use the CSV module to parse it
f = open('training_data.csv', 'rU')
data = list(csv.DictReader(f))
# Make an empty list for our processed data
features = []
# Loop through all the lines in the CSV
for i in data:
    description = i.get('NARRATIVE')
    classification = i.get('classification')
    feats = tokenize(description)
    features.append((feats, classification))

f.close()

In [5]:
# Here's what this looks like
print features[0]

({u'kick|polic': True, u'use': True, u'his': True, u'leg': True, u'polic': True, u'under|arrest': True, u'right|leg': True, u'place|under': True, u'back': True, u'sergeant|back': True, u'arrest': True, u'right': True, u'place': True, u'sergeant': True, u'use|his': True, u'under': True, u'his|right': True, u'arrest|use': True, u'leg|kick': True, u'kick': True, u'polic|sergeant': True}, 'minor')


## Train the classifiers

For this analysis we used two machine learning classifiers. The first is a linear [support vector machine](http://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html) from the stellar [scikit-learn Python library](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). The second is a [maximum entropy classifier](http://www.nltk.org/book/ch06.html#maximum-entropy-classifiers). For the official analysis I used the [MegaM](http://www.umiacs.umd.edu/~hal/megam/) optimization package to dramatically improve the training speed. Here, for simplicity, I'm using the NLTK built in trainer.

In [6]:
# Train our classifiers. Let's start with Linear SVC
# Make a data prep pipeline
pipeline = Pipeline([
    ('tfidf', TfidfTransformer()),
    ('linearsvc', LinearSVC()),
])
# make the classifier
linear_svc = SklearnClassifier(pipeline)
# Train it
linear_svc.train(features)

<SklearnClassifier(Pipeline(steps=[('tfidf', TfidfTransformer(norm=u'l2', smooth_idf=True, sublinear_tf=False,
         use_idf=True)), ('linearsvc', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]))>

In [7]:
# Next, let's do the Maximum Entropy
maxent = MaxentClassifier.train(features)

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.500
             2          -0.43483        0.970
             3          -0.32266        0.990
             4          -0.25840        0.990
             5          -0.21623        1.000
             6          -0.18624        1.000
             7          -0.16375        1.000
             8          -0.14621        1.000
             9          -0.13215        1.000
            10          -0.12061        1.000
            11          -0.11097        1.000
            12          -0.10279        1.000
            13          -0.09577        1.000
            14          -0.08967        1.000
            15          -0.08432        1.000
            16          -0.07959        1.000
            17          -0.07538        1.000
            18          -0.07161        1.000
            19          -0.06821        1.000
 

## Testing the classifiers

Now let's test these out! For this example we're only using a training sample of 100 crimes, which is not going to produce very accurate results. For our official analysis, we used a training sample of more than 20,000 crimes we reviewed as part of a previous story in 2014. We also chose to use two classifiers because, though they agreed on the vast majority of crimes, each classifier did a better job with some edge cases we didn't want to miss. You can check out the results below.

In [8]:
# Now, let's try these out
test_data = list(csv.DictReader(open('test_data.csv', 'rU')))
for i in test_data:
    description = i.get('NARRATIVE')
    classification = i.get('classification')
    toks = tokenize(description)
    # now grab the results of our classifiers
    maxent_class = maxent.classify(toks)
    svc_class = linear_svc.classify(toks)
    print('correct: %s | maxent: %s | linear svc: %s' % (classification, maxent_class, svc_class))

correct: minor | maxent: serious | linear svc: serious
correct: minor | maxent: minor | linear svc: minor
correct: minor | maxent: serious | linear svc: serious
correct: serious | maxent: serious | linear svc: serious
correct: serious | maxent: serious | linear svc: serious
correct: minor | maxent: minor | linear svc: minor
correct: minor | maxent: minor | linear svc: minor
correct: minor | maxent: serious | linear svc: serious
correct: serious | maxent: serious | linear svc: serious
correct: minor | maxent: minor | linear svc: minor
correct: minor | maxent: minor | linear svc: minor
