# Topical Discovery and Latent Dirichlet Allocation - Example 03

In this example we will use the transcript from the Obama Romney presidential debate in 2012. Our goal is:

1. Identify the overall topic of discussion
2. Identify topics brought forward by President Obama
3. Identify topics brought forward by the presidential candidate Romney

When working with LDA you want your text to be as clean as possible so to generate more meaningful results.

In [1]:
import re
import nltk
import gensim
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.collocations import *
from nltk.corpus import stopwords
from gensim import corpora
import pyLDAvis.gensim

In [3]:
def remove_utf(text):
    return re.sub(r'[^\x00-\x7f]',r' ',text)

def load_stopwords():
    swords=[]
    path="./data/stopwords.txt"
    file_input = open (path,"r")
    lines = file_input.readlines()
    for line in lines:
        swords.append(line[:-1])
    file_input.close()
    return swords

def loadCorpus(path):
    data = []
    file_input = open (path,"r")
    lines = file_input.readlines()
    for line in lines:
        data.append(remove_utf(line[:-1].lower()))
    file_input.close()
    return data

stopwords = load_stopwords()
path = "./data/Obama-Romney-Debate.txt"
debate = loadCorpus(path)
print (debate)

['schieffer: good evening from the campus of lynn university here in boca raton, florida. this is the fourth and last debate of the 2012 campaign, brought to you by the commission on presidential debates.', '', "this one's on foreign policy. i'm bob schieffer of cbs news. the questions are mine, and i have not shared them with the candidates or their aides.", '', 'schieffer: the audience has taken a vow of silence -- no applause, no reaction of any kind, except right now when we welcome president barack obama and governor mitt romney.', '', ' ', '', "gentlemen, your campaigns have agreed to certain rules and they are simple. they've asked me to divide the evening into segments. i'll pose a question at the beginning of each segment. you will each have two minutes to respond and then we will have a general discussion until we move to the next segment.", '', "tonight's debate, as both of you know, comes on the 50th anniversary of the night that president kennedy told the world that the so

In [4]:
print (stopwords)

['a', 'able', 'about', 'above', 'across', 'actually', 'after', 'again', 'against', 'all']


In [5]:
discourse = {'schieffer:':[],'romney:':[],"obama:":[]}
keys = discourse.keys()

current = ""
for line in debate:
    if len(line)>5:
        for key in keys:
            if line.startswith(key):
                current = key
        l = discourse[current]
        l.append(line)
        discourse[current]=l

In [6]:
discourse["obama:"]

["obama: well, my first job as commander in chief, bob, is to keep the american people safe. and that's what we've done over the last four years.",
 "we ended the war in iraq, refocused our attention on those who actually killed us on 9/11. and as a consequence, al qaeda's core leadership has been decimated.",
 "in addition, we're now able to transition out of afghanistan in a responsible way, making sure that afghans take responsibility for their own security. and that allows us also to rebuild alliances and make friends around the world to combat future threats. now with respect to libya, as i indicated in the last debate, when we received that phone call, i immediately made sure that, number one, that we did everything we could to secure those americans who were still in harm's way; number two, that we would investigate exactly what happened, and number three, most importantly, that we would go after those who killed americans and we would bring them to justice. and that's exactly w

<h1>Pre-Processing</h1>
<h2>Tokenization and Collocations</h2>
For our tokenization task, let's use the nltk WordTokenizer ...

In [8]:
def remove_punctuation(corpus):
    punctuations = r".,\"-\\/#!?$%\^&\*;:{}=\-_'~()"    
    filtered_corpus = [token for token in corpus if (not token in punctuations)]
    return filtered_corpus

def apply_stopwording(corpus, min_len):
    black_list = ['schieffer','obama','romney']
    filtered_corpus = [token for token in corpus if (not token in stopwords and not token in black_list and len(token)>min_len)]
    return filtered_corpus

def apply_lemmatization(corpus):
    lemmatizer = nltk.WordNetLemmatizer()
    normalized_corpus = [lemmatizer.lemmatize(token) for token in corpus]
    return normalized_corpus

def getCollocations(text, min_freq, coll_num):
    bigrams = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(text)
    finder.apply_freq_filter(min_freq)
    collocations = finder.nbest(bigrams.pmi, coll_num)
    return collocations

def replaceCollocationsInText(text,collocations):
    first = [t[0]for t in collocations]
    second = [t[1] for t in collocations]

    dtokens = []
    i = 0
    while i<=(len(text)-1):
        try:
            idx1 = first.index(text[i])
            if (text[i+1]==second[idx1]):
                dtokens.append(first[idx1]+"_"+second[idx1])
                i=i+1
        except:
            dtokens.append(text[i])
            pass
        i=i+1
    return dtokens

def processCorpus(corpus_data):
    #The input is an array of unprocessed text documents
    min_frequency = 3
    num_of_collocations=100
    text=""
    corpus=[]
    tokens =[]
    
    #Extract corpus and preprocess data
    for line in corpus_data:
        t = nltk.word_tokenize(line)
        doc = nltk.Text(t)
        doc_clean = nltk.Text(apply_lemmatization(apply_stopwording(remove_punctuation(doc), 3)))
        corpus.append(doc_clean)
        tokens.extend(doc_clean.tokens)
        text=text+line
    
    #Identify collocations
    collocations = getCollocations(tokens,min_frequency,num_of_collocations)
    docs = []
    for doc in corpus:
        t = replaceCollocationsInText(doc,collocations)
        if (len(t)>0):
            docs.append(replaceCollocationsInText(doc,collocations))
    return docs

In [9]:
docs = processCorpus(debate)
print (len(docs))
print (docs[0:10])

405
[['evening', 'campus', 'lynn_university', 'boca', 'raton', 'florida', 'fourth', 'debate', '2012', 'campaign', 'brought', 'commission', 'presidential', 'debate'], ['foreign_policy', 'news', 'mine', 'shared', 'candidate', 'aide'], ['audience', 'silence', 'applause', 'reaction', 'except', 'welcome', 'barack', 'governor', 'mitt'], ['gentleman', 'campaign', 'agreed', 'rule', 'simple', 'divide', 'evening', 'segment', 'pose', 'question_segment', 'minute', 'discussion', 'move', 'segment'], ['tonight', 'debate', 'come', '50th', 'anniversary', 'night', 'kennedy', 'told', 'soviet', 'union', 'installed', 'missile', 'cuba', 'closest', 'sobering', 'reminder', 'unexpected', 'threat_national', 'abroad'], ['segment', 'challenge', 'changing', 'middle_east', 'terrorism', 'segment', 'topic', 'question_segment', 'subject', 'concern', 'libya', 'controversy', 'continues', 'dead', 'including', 'ambassador', 'remain', 'caused', 'spontaneous', 'intelligence', 'failure', 'policy', 'failure', 'attempt', 'misl

<h1>Topic Modeling with LDA</h1>
<h2>Creating a dictionary from the corpus</h2>

We will create a bag-of-words representation of the dictionary
You will get a warning which can be ignored ...

In [2]:
k=5
iterations = 40

dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
topic_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=k, id2word = dictionary, passes = iterations)
lda_vis = pyLDAvis.gensim.prepare(topic_model,corpus,dictionary,sort_topics=False)
pyLDAvis.display(lda_vis)

NameError: name 'docs' is not defined

## Obama related Topics
What were the topic referred by Obama?

In [23]:
k=4
docs = processCorpus(discourse["obama:"])
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
topic_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=k, id2word = dictionary, passes = iterations)
lda_vis = pyLDAvis.gensim.prepare(topic_model,corpus,dictionary,sort_topics=False)
pyLDAvis.display(lda_vis)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]


## Romney related Topics
What were the topic referred by Obama?

In [22]:
k=5
docs = processCorpus(discourse["romney:"])
dictionary = corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]
topic_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=k, id2word = dictionary, passes = iterations)
lda_vis = pyLDAvis.gensim.prepare(topic_model,corpus,dictionary,sort_topics=False)
pyLDAvis.display(lda_vis)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
