In this notebook we'll explore topic modeling to discover broad themes in a collection of movie summaries.  To get started, install gensim:

In [None]:
!pip install gensim==3.8.3

In [1]:
import nltk
import re
import gensim
from gensim import corpora
import operator

nltk.download('stopwords')
from nltk.corpus import stopwords

import numpy as np
import random

random.seed(1)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dbamman/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def read_stopwords(filename):
    stopwords={}
    with open(filename) as file:
        for line in file:
            stopwords[line.rstrip()]=1
    return stopwords

Since we're running topic modeling on texts with lots of names, we'll add the Jockers list of stopwords (which includes character names) to our stoplist.

In [3]:
stop_words = {k:1 for k in stopwords.words('english')}
stop_words.update(read_stopwords("../data/jockers.stopwords"))
stop_words["'s"]=1
stop_words=list(stop_words.keys())

In [4]:
def filter(word, stopwords):
    
    """ Function to exclude words from a text """
    
    # no stopwords
    if word in stopwords:
        return False
    
    # has to contain at least one letter
    if re.search("[A-Za-z]", word) is not None:
        return True
    
    return False

In [5]:
def read_docs(plotFile, metadataFile, stopwords):
    
    names={}
    box={}
    
    with open(metadataFile, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            idd=cols[0]
            name=cols[2]
            boxoffice=cols[4]
            if len(boxoffice) != 0:
                box[idd]=int(boxoffice)
                names[idd]=name
    
    n=5000
    target_movies={}


    sorted_box = sorted(box.items(), key=operator.itemgetter(1), reverse=True)
    for k, v in sorted_box[:n]:
        target_movies[k]=names[k]
    
    docs=[]
    names=[]
   
    with open(plotFile, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            idd=cols[0]
            text=cols[1]
            
            if idd in target_movies:
                tokens=nltk.word_tokenize(text.lower())
                tokens=[x for x in tokens if filter(x, stopwords)]
                docs.append(tokens)
                name=target_movies[idd]
                names.append(name)
    return docs, names

We'll read in summaries of the 5,000 movies with the highest box office revenues.

In [6]:
metadataFile="../data/movie.metadata.tsv"
plotFile="../data/plot_summaries.txt"
data, doc_names=read_docs(plotFile, metadataFile, stop_words)

We will convert the movie summaries into a bag-of-words representation using gensim's [corpora.dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) methods.

In [7]:
# Create vocab from data; restrict vocab to only the top 10K terms that show up in at least 5 documents 
# and no more than 50% of all documents

dictionary = corpora.Dictionary(data)
dictionary.filter_extremes(no_below=5, no_above=.5, keep_n=10000)

In [8]:
# Replace dataset with numeric ids words in vocab (and exclude all other words)
corpus = [dictionary.doc2bow(text) for text in data]

In [9]:
num_topics=20

Now let's run a topic model on this data using gensim's built-in LDA.

In [12]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=num_topics, 
                                           passes=10,
                                           alpha='auto')

We can get a sense of what the topics are by printing the top 10 words with highest $P(word \mid topic)$ for each topic

In [13]:
for i in range(num_topics):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model.show_topic(i, topn=10)])))

topic 0:	president white turner states sullivan united campaign black washington campbell
topic 1:	killer gang smith police murder detective cat case bishop mouse
topic 2:	town find killed dracula kill vampire infected train group brothers
topic 3:	school high students student teacher class friends college kid camp
topic 4:	father family life mother time years tells wife daughter film
topic 5:	war army soldiers doc men japanese general battle colonel american
topic 6:	dr. band evil satan ghost spirit hospital body film child
topic 7:	film life relationship hotel company time women job kane end
topic 8:	show film money race job fight big win time club
topic 9:	book case murder judge tells office story evidence court trial
topic 10:	kill escape killed men kills death killing team police dead
topic 11:	team game coach play football brown player players win playing
topic 12:	bond jaguar agent flynn bolt frost nash formula knox diamonds
topic 13:	house tells night goes finds day father moth

Another way of understanding topics is to print out the documents that have the highest topic representation -- i.e., for a given topic $k$, the documents with highest $P(topic=k | document)$.  How much do the documents listed here align with your understanding of the topics?

In [14]:
topic_model=lda_model 

topic_docs=[]
for i in range(num_topics):
    topic_docs.append({})
for doc_id in range(len(corpus)):
    doc_topics=topic_model.get_document_topics(corpus[doc_id])
    for topic_num, topic_prob in doc_topics:
        topic_docs[topic_num][doc_id]=topic_prob

for i in range(num_topics):
    print("%s\n" % ' '.join([term for term, freq in topic_model.show_topic(i, topn=10)]))
    sorted_x = sorted(topic_docs[i].items(), key=operator.itemgetter(1), reverse=True)
    for k, v in sorted_x[:5]:
        print("%s\t%.3f\t%s" % (i,v,doc_names[k]))
    print()
    
    

president white turner states sullivan united campaign black washington campbell

0	0.927	Made in Dagenham
0	0.589	Swing Vote
0	0.583	Murder in the First
0	0.534	Invictus
0	0.522	Mother

killer gang smith police murder detective cat case bishop mouse

1	0.897	Bad Girls
1	0.889	Freebie and The Bean
1	0.718	Gang Related
1	0.646	The Unjust
1	0.623	The Hard Way

town find killed dracula kill vampire infected train group brothers

2	0.763	Dracula 2000
2	0.746	Van Helsing
2	0.731	Return of the Living Dead Part II
2	0.696	Bats
2	0.621	REC 3

school high students student teacher class friends college kid camp

3	0.939	My Boss, My Teacher
3	0.927	Conduct Zero
3	0.883	Goodbye, Columbus
3	0.740	Fired Up
3	0.697	Assassination of a High School President

father family life mother time years tells wife daughter film

4	0.970	Immortal Beloved
4	0.969	On Golden Pond
4	0.969	Beginners
4	0.968	The Other Boleyn Girl
4	0.967	The Duchess

war army soldiers doc men japanese general battle colonel american



**Optional**: Mallet is topic modeling software that tends to generate better topics than gensim's native implementation (in part due to different inference techniques).  Gensim is compatible with mallet; to get it working, download [Mallet](http://mallet.cs.umass.edu/download.php) and set the path below to the `mallet` application on your computer.  (In the example below, I've downloaded mallet to my `Downloads` directory, so change that to whenever you download it.).  Then execute the following to run mallet on the same data as above.

In [15]:
mallet_path="/Users/dbamman/Downloads/mallet-2.0.8/bin/mallet"

In [16]:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary)
lda_mallet_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

In [17]:
for i in range(num_topics):
    print(' '.join([term for term, freq in lda_mallet_model.show_topic(i, topn=10)]))

water find island group boat crew ship lake river plane
tells n't night asks day house takes sees leaves party
kill men killed kills escape gang gun killing shoots escapes
relationship life wife wedding married women husband woman marriage affair
school friends mr. high parents party girls friend boys college
police murder prison case drug death crime evidence detective arrest
car police train truck station drive hotel find road driving
film ends story movie scene time end final people world
time tells finds book asks leaves story find day reveals
house room body finds find door killed dead head window
family father children brother town sister local brothers parents eventually
agent president fbi secret agents smith security cia meeting states
show club band dance music big stage perform york play
war army orders general attack soldiers captain mission american men
money job business company boss pay plan bank deal work
mother father hospital child life daughter baby dr. boy years
whi