In this notebook we'll explore topic modeling to discover broad themes in a collection of movie summaries.  To get started, install pyLDAvis and update numpy and gensim:

```sh
pip install pyLDAvis==2.1.2
pip install numpy --upgrade
pip install gensim --upgrade
```

In [None]:
import nltk
import glob, os, re
import gensim
from gensim import corpora
import operator

nltk.download('stopwords')
from nltk.corpus import stopwords

import pyLDAvis
import pyLDAvis.gensim

from sklearn import preprocessing
from sklearn import linear_model
import numpy as np

In [None]:
def read_stopwords(filename):
    stopwords={}
    with open(filename) as file:
        for line in file:
            stopwords[line.rstrip()]=1
    return stopwords

Since we're running topic modeling on texts with lots of names, we'll add the Jockers list of stopwords (which includes character names) to our stoplist.

In [None]:
stop_words = {k:1 for k in stopwords.words('english')}
stop_words.update(read_stopwords("../data/jockers.stopwords"))
stop_words["'s"]=1
stop_words=list(stop_words.keys())

In [None]:
def filter(word, stopwords):
    
    """ Function to exclude words from a text """
    
    # no stopwords
    if word in stopwords:
        return False
    
    # has to contain at least one letter
    if re.search("[A-Za-z]", word) is not None:
        return True
    
    return False

In [None]:
def read_docs(inputDir, stopwords):
    """ Read in movie documents (all ending in .txt) from an input folder"""
    
    docs=[]
    names=[]
    for idx, filename in enumerate(glob.glob(os.path.join(inputDir, '*.txt'))):
        if idx >= 100:
            break
        with open(filename) as file:
            tokens=nltk.word_tokenize(file.read().lower())
            tokens=[x for x in tokens if filter(x, stopwords)]
            docs.append(tokens)
            basename=os.path.basename(filename)
            name, file_extension = os.path.splitext(basename)
            names.append(name)
    return docs, names

In [None]:
text_dir="../data/movie_summaries"
data, doc_names=read_docs(text_dir, stop_words)

We will convert the movie summaries into a Bag-of-Words representation using gensim's [corpora.dictionary](https://radimrehurek.com/gensim/corpora/dictionary.html) methods.

In [None]:
# Create vocab from data; restrict vocab to only the top 10K terms that show up in at least 5 documents 
# and no more than 50% of all documents

dictionary = corpora.Dictionary(data)
dictionary.filter_extremes(no_below=5, no_above=.5, keep_n=10000)

In [None]:
# Replace dataset with numeric ids words in vocab (and exclude all other words)
corpus = [dictionary.doc2bow(text) for text in data]

In [None]:
num_topics=20

First, let's try using gensim's built-in LDA.

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=num_topics, 
                                           passes=10,
                                           alpha='auto')

We can get a sense of what the topics are by printing the top 10 words with highest $P(word \mid topic)$ for each topic

In [None]:
for i in range(num_topics):
    print(' '.join([term for term, freq in lda_model.show_topic(i, topn=10)]))

Mallet is a great java package that tends to yield better results (in terms of topic coherence).  Download it from [here](http://mallet.cs.umass.edu/download.php) and point the `mallet_path` line below to its location on your computer.

In [None]:
mallet_path="/Users/mashabelyi/Downloads/mallet-2.0.8/bin/mallet"

In [None]:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary)
lda_mallet_model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)

In [None]:
for i in range(num_topics):
    print(' '.join([term for term, freq in lda_mallet_model.show_topic(i, topn=10)]))

Another way of understanding topics is to print out the documents that have the highest topic representation -- i.e., for a given topic $k$, the documents with highest $P(topic=k | document)$.  How much do the documents listed here align with your understanding of the topics?

In [None]:
topic_model=lda_mallet_model 

topic_docs=[]
for i in range(num_topics):
    topic_docs.append({})
for doc_id in range(len(corpus)):
    doc_topics=topic_model.get_document_topics(corpus[doc_id])
    for topic_num, topic_prob in doc_topics:
        topic_docs[topic_num][doc_id]=topic_prob

for i in range(num_topics):
    print("%s\n" % ' '.join([term for term, freq in topic_model.show_topic(i, topn=10)]))
    sorted_x = sorted(topic_docs[i].items(), key=operator.itemgetter(1), reverse=True)
    for k, v in sorted_x[:5]:
        print("%s\t%.3f\t%s" % (i,v,doc_names[k]))
    print()
    
    

Let's also explore topics using pyLDAvis, a visualization library.

In [None]:
pyLDAvis.enable_notebook()

In [None]:
vis = pyLDAvis.gensim.prepare(topic_model, corpus, dictionary, lambda_step=.5)

In [None]:
vis

Q1: Adapt the code above to operate on the documents you used for classification earlier in this course.    Execute the following for your trained model `lda_mallet_model` and `num_topics` so we can explore the topics that did emerge.

In [None]:
for i in range(num_topics):
    print(' '.join([term for term, freq in lda_mallet_model.show_topic(i, topn=10)]))