# Topic Modeling / LDA

based on tutorial found at

https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Note: The Oathkeeper data set is >250,000 lines. Reading it takes a couple minutes, so be patient. Processing the whole dataset takes a lot longer, so this example uses only a subset of the datafile to test the analysis flow.


In [1]:
import pandas as pd

In [2]:
# prepare cleanup function
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

In [3]:
# Prepare function to print top words in cleaned data
import collections

def count_words( row ):
    
    wlist = row['post_clean']
    dcount = collections.Counter(wlist).most_common(ncount)

    print("Word count in post: " , row.index.name )
    print(dcount)

### Read input data do some clean up

In [4]:
# read data
df = pd.read_pickle("nationalOathKeepers")

In [5]:
# remove trailing whitespace in post_forum
df['post_forum'] = df['post_forum'].str.strip()

In [6]:
print("Size of original dataframe: ", df.shape)

Size of original dataframe:  (257341, 7)


In [7]:
# Group dataframe by 'thread_name' to merge discussion in single thread into one 'text sample'
df2 = df.groupby('thread_name').sum()

In [8]:
# print size of grouped dataframe
print("Size of grouped-by dataframe: ", df2.shape)

Size of grouped-by dataframe:  (23528, 5)


In [9]:
# Add column counting the number of words in the post_clean column
# This is a measure of thread length and allows to exclude too short threads
# from topic analysis
df2['word_count'] = [ len(x) for x in df2['post_content'] ]

In [10]:
# select only rows with a certain minimum number of words in content
df2sample = df2[ df2['word_count'] > 100000 ].copy()

In [12]:
# select only subset of rows (speeding things up for testing purpose)
nrows = 10
df2sample = df2sample[:nrows].copy()

In [13]:
# clean up
df2sample['post_clean'] = [clean(doc).split() for doc in df2sample['post_content'] ]

In [None]:
# clean up part 2: Remove most frequent words, stop words, etc
# @TDODO implement this

### Word frequency analysis

Doing a simple frequency count of words in each thread in the dataframe

Print a ranked list of the most frequent words occuring in each thread fomr the cleaned up data subset (each row corresponds to a single thread with all posts from that thread grouped together):

In [20]:
# Print top words in cleaned data
df2sample.apply(count_words, axis=1)
print("")

Word count in post:  None
[('gun', 170), ('mental', 151), ('health', 133), ('control', 105), ('article', 97)]
Word count in post:  None
[('view', 747), ('police', 437), ('post', 434), ('thread', 388), ('forum', 321)]
Word count in post:  None
[('view', 299), ('post', 158), ('target', 137), ('terrorist', 108), ('please', 107)]
Word count in post:  None
[('view', 3212), ('oath', 3071), ('keeper', 2850), ('post', 1736), ('coin', 1386)]
Word count in post:  None
[('bus', 247), ('harrisburg', 246), ('pa', 231), ('reply', 204), ('pm', 197)]
Word count in post:  None
[('view', 384), ('9mm', 208), ('post', 197), ('found', 138), ('levoy', 129)]
Word count in post:  None
[('muslim', 227), ('friend', 160), ('battle', 150), ('disengaging', 144), ('reply', 144)]
Word count in post:  None
[('dc', 334), ('may', 331), ('march', 305), ('reply', 274), ('re', 259)]
Word count in post:  None
[('benghazi', 10415), ('attack', 7411), ('u', 4951), ('libya', 4540), ('state', 4394)]
Word count in post:  None
[(

It seems like the most frequent words in each thread are characteristic of the topic of a given thread. There are still forum or language specific words in these lists (like 'reply', 'pm', etc) that should be removed in the clean-up step above.

### Applying LDA Topic Modeling algorithm

Using the LDA Topic Modeling library to try Topic Modeling. No tuning of parameters liek number of topics yet. Just running an example as 'black-box'.

In [16]:
# Importing Gensim as preparatin for Topic Modeling
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(df2sample['post_clean'])

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in df2sample['post_clean']]

In [17]:
# Do Topic Modeling
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=5, id2word = dictionary, passes=50)

In [19]:
# print output
ldamodel.print_topics(num_topics=5, num_words=10)

[(0,
  '0.038*"view" + 0.026*"oath" + 0.024*"keeper" + 0.021*"post" + 0.013*"re" + 0.011*"coin" + 0.011*"forum" + 0.010*"join" + 0.010*"silver" + 0.010*"date"'),
 (1,
  '0.009*"reply" + 0.008*"re" + 0.008*"may" + 0.008*"dc" + 0.007*"march" + 0.007*"oath" + 0.006*"people" + 0.005*"keeper" + 0.005*"one" + 0.005*"it"'),
 (2,
  '0.009*"reply" + 0.008*"re" + 0.006*"state" + 0.005*"one" + 0.005*"muslim" + 0.005*"bus" + 0.005*"harrisburg" + 0.005*"people" + 0.005*"would" + 0.005*"pa"'),
 (3,
  '0.016*"benghazi" + 0.012*"attack" + 0.008*"u" + 0.007*"libya" + 0.007*"state" + 0.006*"said" + 0.005*"security" + 0.005*"one" + 0.005*"american" + 0.004*"obama"'),
 (4,
  '0.000*"benghazi" + 0.000*"attack" + 0.000*"libya" + 0.000*"state" + 0.000*"u" + 0.000*"said" + 0.000*"re" + 0.000*"obama" + 0.000*"american" + 0.000*"reply"')]

As found in the frequency analysis before, there are still words that should be removed from the text samples because they are language or forum specific ('re', 'reply', ...) but are not related to the actual content or topic.

The topics and associated words listed above don't seem like a useful set of topics. How can we better estimate the number of topics to be found in the text samples? Does using more text samples help? Are there any other paramters we can tune for the LDA algorithm?