# Topic Modeling / LDA

based on tutorial found at

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Note: The Oathkeeper data set is ~250,000 lines. Reading it takes a couple minutes, so be patient. Processing the whole dataset takes a lot longer, so this example uses only a subset of the datafile to test the analysis flow.


In [1]:
import pandas as pd

In [2]:
# prepare cleanup function
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [3]:
# read data
df = pd.read_pickle("nationalOathKeepers")

In [4]:
# remove trailing whitespace in post_forum
df['post_forum'] = df['post_forum'].str.strip()

In [10]:
print("Size of original dataframe: ", df.shape)

Size of original dataframe:  (257341, 7)


In [7]:
# Group dataframe by 'thread_name' to merge discussion in single thread into one 'text sample'
df2 = df.groupby('thread_name').sum()

In [8]:
# print size of grouped dataframe
print("Size of grouped-by dataframe: ", df2.shape)

Size of grouped-by dataframe:  (23528, 5)


In [11]:
# pick only first nrows rows from this dataframe for testing
nrows = 10
df2sample = df2[:nrows]

In [12]:
# clean up
#docs = df.post_content
df2sample['post_clean'] = [clean(doc).split() for doc in df2sample.post_content]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(df2sample['post_clean'])

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df2sample['post_clean']]

In [14]:
# Do Topic Modeling
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word = dictionary, passes=50)

In [15]:
# print output
ldamodel.print_topics(num_topics=5, num_words=4)

[(0, '0.018*"job" + 0.015*"work" + 0.011*"get" + 0.010*"back"'),
 (1, '0.010*"boob" + 0.009*"people" + 0.008*"state" + 0.008*"insurance"'),
 (2, '0.012*"true" + 0.012*"felony" + 0.010*"first" + 0.010*"bill"'),
 (3, '0.015*"reply" + 0.014*"re" + 0.013*"camp" + 0.010*"fema"'),
 (4, '0.011*"bank" + 0.011*"world" + 0.010*"government" + 0.009*"foreign"')]

In [20]:
# Print top words in raw input data
import collections

def count_words( wlist ):
    ncount = 5
    dcount = collections.Counter(wlist).most_common(ncount)
    print(dcount)

In [21]:
count_words(df2sample['post_clean'][0])

[('purgedmilitaryhighofficers', 2), ('reply', 2), ('preserve', 2), ('time', 2), ('837', 1)]


In [26]:
df2sample2 = df2sample[ df2sample['post_clean'].str.len > 50 ]

TypeError: '>' not supported between instances of 'method' and 'int'