# LDA Mallet Model notebook: Defamation in the US 2016 Presidential Election

Here we implement the LDA Mallet Model and evaluate performance

Link to reference: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#16buildingldamalletmodel

Prior to identifying an approach for defamation detection in tweets, we aimed to understand the key topics discussed in the tweets using topic modeling. Topic models are a ubiquitous tool in text analysis because of their versatility in handling massive quantities of unlabeled text. Here, a topic refers to a group of words which seem to have a higher probability of appearing together. By running our corpus through a topic model, we were able to preview potential areas of influence in the strategies of our users in relation to defamation attempts. 

Although there are many variants of topic models in literature, we used the MALLET (MAchine Learning for LanguagE Toolkit) (\cite{McCallumMALLET}) model for our analysis. MALLET is an excellent tool for when one aims to use topic modeling as an initial exploratory tool in the data because it is a scalable implementation of Gibbs sampling which works very well for expediting clustering. We first ran the model to do an efficient search for the optimal amount of topics in our corpus. The search range spanned from 5 to 35 topics with a step size of 3. As a result, the optimal number of topics was 26 with a highest coherence value of 39.32. The primary results of the model are shown in Figure \ref{fig:num_tweets_topic} where we see a labeled 26 topics along the y-axis, and the number of tweets pertaining to each topic along the x-axis. Given the context of the US elections, and American current events in general, it is reasonable that the main topics include Trump, Clinton, Barack Obama, terrorism, economics, and gun control. Overall, majority of the identified topics are sensitive to the US and are a clear choice for attempts at at influencing opinions. Furthermore, in Figure \ref{fig:topic_freq_date}, for each of the topics, we can see how tweet frequency changes by date between 2014 and 2017. What stands out is the peaks of tweets around Summer 2015, which is coincidentally also the starting time of the Trump campaign.

# Import packages

In [2]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt


# Load files

In [15]:
import os
os.getcwd()

'/Users/kristin.lomicka/Documents/BGSE_Courses/text_mining/Project'

In [27]:
# Import pickle files
import pickle

# Tweets
with open('clean_tweets', 'rb') as outputs:
    tweets = pickle.load(outputs)

# Model results

with open('ldamallet_17', 'rb') as outputs:
    optimal_model = pickle.load(outputs)

In [16]:
# Load mallet file

# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'mallet-2.0.8/bin/mallet' # update this path

# Preprocess

In [5]:
# Create dictionary and corpus needed for topic model

# Create Dictionary
id2word = corpora.Dictionary(tweets)

# Create Corpus
texts = tweets

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1)]]


#  Run LDA Mallet Model

In [20]:
# Run LDA for a single k (# of topics)
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=17, id2word=id2word)

In [29]:
# Show Topics
#optimal_model = ldamallet
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=optimal_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

[(0,
  '0.145*"rt" + 0.031*"vote" + 0.019*"cruz" + 0.019*"support" + 0.018*"https" '
  '+ 0.015*"democrats" + 0.014*"poll" + 0.012*"ted" + 0.012*"retweet" + '
  '0.011*"care"'),
 (1,
  '0.165*"rt" + 0.097*"obama" + 0.018*"https" + 0.015*"russia" + '
  '0.013*"obama\'s" + 0.013*"party" + 0.011*"wall" + 0.009*"jobs" + '
  '0.009*"dnc" + 0.009*"open"'),
 (2,
  '0.110*"rt" + 0.033*"time" + 0.022*"years" + 0.013*"money" + 0.013*"million" '
  '+ 0.012*"states" + 0.010*"change" + 0.010*"united" + 0.010*"days" + '
  '0.010*"long"'),
 (3,
  '0.379*"rt" + 0.158*"islam" + 0.122*"lesson" + 0.122*"today\'s" + 0.087*"️" '
  '+ 0.016*"️️" + 0.008*"religion" + 0.007*"peace" + 0.004*"\u200d" + '
  '0.002*"️️️"'),
 (4,
  '0.051*"muslim" + 0.034*"muslims" + 0.025*"islamic" + 0.021*"isis" + '
  '0.017*"terrorist" + 0.014*"terror" + 0.013*"refugees" + 0.013*"terrorists" '
  '+ 0.012*"attacks" + 0.011*"terrorism"'),
 (5,
  '0.193*"rt" + 0.033*"video" + 0.021*"watch" + 0.020*"\'s" + 0.017*"https" + '
  '0.01

In [24]:
# Find optimal number of topics

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
compute_coherence_values(dictionary=id2word, corpus=corpus, texts=texts, start=5, limit=40, step=6)

In [None]:
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

In [None]:
# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

In [28]:
# Find the dominant topic for each tweet

def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=texts)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,11.0,0.0886,"police, killed, breaking, isis, attack, shooti...","[new, convoy, arrived]"
1,1,11.0,0.11,"police, killed, breaking, isis, attack, shooti...","[civilians, killed, mosul]"
2,2,6.0,0.0886,"rt, white, black, people, house, hate, man, am...","[truly, know, know, true, paybacks, hell, pers..."
3,3,9.0,0.0915,"rt, day, love, today, life, remember, twitter,...","[remember, last, summer, dear, ️]"
4,4,13.0,0.1061,"trump, president, donald, rt, trump's, win, su...","[us, gay, arriage, licsence, refused, clerk, k..."
5,5,5.0,0.1714,"rt, video, watch, 's, https, hey, here's, wron...","[former, president, jimmy, carter, begin, radi..."
6,6,16.0,0.1369,"hillary, clinton, fbi, state, report, emails, ...","[new, trove, purported, ashley, madison, data,..."
7,7,3.0,0.1444,"rt, islam, lesson, today's, ️, ️️, religion, p...","[today's, lesson, islam, ️, ️]"
8,8,8.0,0.1886,"rt, woman, man, york, home, http, year, school...","[new, york, prison, escapee, david, sweat, ple..."
9,9,8.0,0.1805,"rt, woman, man, york, home, http, year, school...","[subways, jared, fogle, faces, years, prison, ..."


In [30]:
df_dominant_topic.to_csv('mallet17_topics_by_tweet.csv')
df_dominant_topic = pd.read_csv('mallet17_topics_by_tweet.csv')