![Public Archive of Police Violence in Cleveland](http://archivingpoliceviolence.org/themes/papvc/images/papvc.png)

This is an example of using topic modeling to look at the [Public Archive of Police Violence in Cleveland](http://archivingpoliceviolence.org/) transcripts.

First let's convert the transcripts from docx files to text files so we can easily read them in. This assumes that [pandoc](https://github.com/jgm/pandoc/releases/tag/1.19.2.1) is installed and that the transcripts are located in a directory called transcripts.

In [1]:
import os
import glob
import subprocess

for pdf_file in glob.glob("transcripts/*.docx"):
    txt_file = pdf_file.replace(".docx", ".txt")
    if not os.path.isfile(txt_file):
        print("converting " + pdf_file)
        subprocess.call(["pandoc", pdf_file, "-o", txt_file])

We're going to use a list of stopwords that we know we're not interested in.

In [2]:
from stopwords import stopwords

print("There are %s stopwords", len(stopwords))

There are %s stopwords 319


If you like you can add stopwords, just make sure they are lower case:

In [3]:
stopwords.add("know")
stopwords.add("like")
stopwords.add("said")
stopwords.add("police")
stopwords.add("people")
stopwords.add("just")
stopwords.add("think")
stopwords.add("like")
stopwords.add("gonna")
stopwords.add("want")
stopwords.add("didn")
stopwords.add("going")

Next let's create a function that returns each text transcript as a list of words. It will remove words less than 3 letters in length, and also remove stopwords.

In [4]:
import re

def docs():
    for txt_file in glob.glob("transcripts/*.txt"):
        text = open(txt_file).read()
        words = []
        for word in re.split(r'\W+', text):
            if  len(word) > 3 and word.lower() not in stopwords and not re.match('^[A-Z]', word):
                words.append(word)
        yield words

You can see it works by getting the first document and printing it out. Remember the stopwords have been removed.

In [5]:
print(next(docs()))

['fiancé', 'goes', 'trouble', 'young', 'child', 'mother', 'daughter', 'sending', 'child', 'play', 'recreation', 'center', 'believe', 'fact', 'security', 'guard', 'understand', 'happened', 'place', 'secured', 'personal', 'experiences', 'couple', 'fiancé', 'come', 'come', 'talk', 'rights', 'sent', 'jail', 'panties', 'house', 'phone', 'hear', 'situation', 'couldn', 'trust', 'clothes', 'coming', 'person', 'habit', 'crime', 'wouldn', 'clothes', 'arrested', 'scared', 'younger', 'nephews', 'future', 'chose', 'children', 'glad', 'worried', 'losing', 'life', 'somebody', 'supposed', 'serve', 'protect']


Ok, now we're ready to start using gensim to do topic modeling. Let's start by creating our word dictionary:

In [6]:
from gensim import corpora

dictionary = corpora.Dictionary(docs())

Now lets create our corpus from the dictionary:

In [7]:
from gensim import models

def ids():
    for doc in docs():
        yield dictionary.doc2bow(doc)

path = "corpus.mm"
corpora.MmCorpus.serialize(path, ids())
corpus = corpora.MmCorpus(path)

In [8]:
lda = models.ldamodel.LdaModel(
    corpus,
    id2word=dictionary,
    num_topics=15,
    passes=20,
    iterations=50
)

In [9]:
topics = lda.top_topics(corpus, num_words=10)

In [10]:
num = 0
for topic in topics:
    num += 1
    print("%s. %s" % (num, ', '.join([t[1] for t in topic[0]])))

1. weapon, guns, city, person, violence, shooting, thing, shoot, right, father
2. belong, officer, line, mean, wrong, guess, kind, better, homeless, issues
3. black, right, time, white, things, thing, happened, mean, saying, years
4. things, time, black, thing, right, good, come, house, getting, years
5. news, indistinct, http, items, archivingpoliceviolence, really, everybody, mean, doing, stuff
6. mean, good, policeman, home, time, officer, happened, cause, things, course
7. came, mean, time, right, left, happened, come, things, thing, stuff
8. little, house, come, bitch, shit, calling, right, make, good, shot
9. really, right, trying, stuff, doing, saying, thing, happened, shot, violence
10. time, came, happened, stuff, went, mean, actually, come, left, told
11. things, overlooked, mean, better, city, feel, reason, inner, protocol, seen
12. items, archivingpoliceviolence, http, tell, feel, officers, thing, report, violence, right
13. doing, tell, good, thing, stop, protect, life, re