In [16]:
#                               RESEARCH ON EXISTING ARTICLES ABOUT TOPIC MODELING

In [17]:
# Intro

In [18]:
# In this Jupyter Notebook I will be investigating two modern approaches of Topic Modeling. The purpose of doing so is to
# gain an understanding of what the current methods are that engineers and researchers use to deploy Topic Modeling. The
# motivation for gaining this insight is to apply these approaches to a dataset obtained from Kaggle that contains
# 34 million Reddit comments from the subreddit WallStreetBets.

In [19]:
#

In [21]:
# Citations

In [22]:
# The articles referenced in this Jupyter Notebook are fully cited at the bottom of the Notebook. 

In [23]:
#

In [24]:
# First Article

In [25]:
# The first paper that is to being examined is named "Topic Modeling for Social Media Content: A Practical Approach".
# In this paper the authors explore a method of Topic Modeling called Latent Dirichlet Allocation, commonly refered
# to as LDA. This is the most widely used method of Topic Modeling in the industry. 
# What is LDA?
#         " a generative probabilistic model for collections of discrete data, LDA is based on a three-level
#         hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over
#         an underlying set of topics. In turn, Each topic is considered as an infinite mixture over an underlying
#         set of topic probabilities [7]. Topic models such as the LDA have become a ubiquitous and effective tool
#         in machine learning."
# (Vala, Shayaa, & Babanejaddehaki, 2016)
# 
# so.....,
# What is LDA?
# LDA assumes that there are hidden topics in the documents that are influencing the words used. LDA's goal is to figure
# out the topics and how they are distributed across the documents. It also aims to find how words are related to these
# topics. It starts with random assignments of words to topics in the documents. Then, it iteratively refines these
# assignments based on the distribution of words and topics in the entire corpus (AKA all documents used).
# To perfrom LDA one can use a Library like Gensim or SciKitlearn.
#
# The paper uses a dataset of 90,527 records of tweets. 
# They cleaned and preprocessed the dataset as the following exerpt explains,
#
#          Before running LDA algorithm over 30 input datasets, we cleaned the social media data in several steps. 
#          To this end, after removing punctuations, extra spaces, and other unnecessary patterns, we created a list
#          of English and Malay stop words including over 600 common words to get deleted from the input dataset.
# (Vala, Shayaa, & Babanejaddehaki, 2016)

In [26]:
# code
# (let's implement what the paper has described; plus drop NA values, NOTE:
#    I was unable to find the dataset the  paper used, so I am using a different
#    dataset than the one the paper uses, the dataset I am using is from Reddit
#    so I will need to issue a profanity warning)

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
def remove_special_characters(text):
    if isinstance(text, str):  # Check if input is a string
        cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
        return cleaned_text
    else:
        return ''

def convert_to_lowercase(text):
    return text.lower()

def remove_extra_whitespace(text):
    cleaned_text = re.sub(r'\s+', ' ', text)
    return cleaned_text

def remove_stopwords(text):
    if isinstance(text, str) and not pd.isnull(text):
        stop_words = set(stopwords.words("english"))
        tokens = nltk.word_tokenize(text)
        filtered_text = ' '.join([word for word in tokens if word.lower() not in stop_words])
        return filtered_text
    else:
        return np.nan

def lemmatize_words(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    lemmatized_text = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
    return lemmatized_text

df = df.drop(df[(df['body'].isna())].index)

df['body'] = df['body'].apply(remove_special_characters)

df['body'] = df['body'].apply(convert_to_lowercase)

df['body'] = df['body'].apply(remove_extra_whitespace)

df['body'] = df['body'].apply(remove_stopwords)

df['body'] = df['body'].apply(lemmatize_words)

words_delete = ["removed", "deleted", " ", "  ", "   "]
author_fullname = pd.NA


df = df.drop(df[(df['body'].isin(words_delete)) | (df['author_fullname'].isna())].index)
df = df.drop(df[df["author_fullname"].isin(words_delete)].index)
df = df.drop(df[(df['created_utc'].isin(words_delete)) | (df['created_utc'].isna())].index)
df = df.drop(df[(df['author'].isin(words_delete)) | (df['author'].isna())].index)
df = df.drop(df[(df['total_awards_received'].isin(words_delete)) | (df['total_awards_received'].isna())].index)

from nltk.tokenize import word_tokenize

def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

df['body'] = df['body'].apply(lambda x: tokenize_text(x))

useless = ["", "yes", "ban", "way", "lol", "f", "nice", "fuck", "link", "ok", "know",
           "rip", "thanks", "gay", "nope", "hope", "guh", "retarded", "stonks, go", "retard", "position, ban",
           "eat, dongus, fuckin, nerd, bot, action, performed, automatically, please, contact, moderator, subreddit, message, compose, r, wallstreetbets, question, concern", 
          "buy", "fact", "right", 
           "post, flaired, dd, dd, list, find, fresh, wsb, dd, http, n, reddit, com, r, wallstreetbets, search, sort, new, amp, q, flair, 3add, amp, restrict, sr, amp, da, misuse, dd, flair, shitposts, short, vague, guess, unexplained, news, link, etc, please, change, flair, dd, mod, notified, thread, sure, flair, use, check, guide, post, flair, http, www, reddit, com, r, wallstreetbets, wiki, linkflair, bot, action, performed, automatically, please, contact, moderator, subreddit, message, compose, r, wallstreetbets, question, concern"
          "good", "15, monday", "future", "yep", "next, week", "always", "say, take, see, hold, gain, nobody, left", "nah", "bruh", 
          "oof", "real", "please, resubmit, shorter, title, bot, action, performed, automatically, please, contact, moderator, subreddit, message, compose, r, wallstreetbets, question, concern", 
          "true", "thank", "rh", "bb", "say", "earnings", "go", "like", "broken, spoke, flair, plz, mod", "priced", "tldr", "yessir", 
          "maybe", "sir, bread, line, bot, action, performed, automatically, please, contact, moderator, subreddit, message, compose, r, wallstreetbets, question, concern",
          "username, check", "exactly", "lmao", "son, bitch", "b", "lolol, future, barely, green"]

df = df.drop(df[(df['body'].isin(useless))].index)

final_df = df.drop("author_fullname", axis=1)

In [66]:
# The above code was not executed here, I already used this code to clean the dataset in a different notebook and I will
# load in that file below. The reason for displaying it here is to show the exact steps taken to clean the data.

In [28]:
# Now that we have a nice squeaky clean dataset lets do some LDA modeling

In [29]:
# Load in data using enviroment variables to keep data secure

In [50]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [51]:
fpathname = os.environ.get("filepathname")

In [52]:
filedata = pd.read_csv(fpathname)

In [57]:
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess


In [94]:
documents = filedata["body"]

# Tokenize the documents using Gensim's simple_preprocess
tokenized_docs = [simple_preprocess(doc) for doc in documents]
# Assuming 'tokenized_docs' is a list of tokenized documents
dictionary = Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]

# Train LDA model
num_topics = 5  # Number of topics you want to discover
lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=10)

# Print topics and their most important words
for topic in lda_model.show_topics():
    print(topic)

(0, '0.056*"http" + 0.053*"com" + 0.044*"amp" + 0.040*"www" + 0.039*"reddit" + 0.034*"message" + 0.022*"post" + 0.022*"wallstreetbets" + 0.018*"compose" + 0.018*"vote"')
(1, '0.021*"na" + 0.019*"day" + 0.018*"gon" + 0.018*"bear" + 0.016*"lol" + 0.014*"market" + 0.014*"go" + 0.014*"tesla" + 0.014*"green" + 0.013*"future"')
(2, '0.019*"get" + 0.011*"money" + 0.010*"work" + 0.009*"people" + 0.008*"shit" + 0.008*"like" + 0.008*"gt" + 0.007*"one" + 0.007*"fucking" + 0.007*"fuck"')
(3, '0.014*"market" + 0.013*"like" + 0.011*"think" + 0.010*"people" + 0.009*"would" + 0.009*"stock" + 0.007*"going" + 0.007*"know" + 0.006*"company" + 0.006*"year"')
(4, '0.028*"call" + 0.020*"put" + 0.016*"buy" + 0.012*"got" + 0.011*"day" + 0.010*"get" + 0.010*"sell" + 0.010*"week" + 0.009*"money" + 0.009*"like"')


In [97]:
# Result 

In [96]:
# The result seems to be reasonable. The categories that the model has created does correctly
# make some groupings around topics.
# The first group is about links being shared, the second is about users having positive
# and negative outlooks on the market and particularly Tesla stock. The third category seems to sum up users personal 
# philosophy. The fourth topic seems to be about people debating companies. The final topic seems to correctly categorize 
# different financial instruments like calls and puts and the terms used with them like buy and sell with a time period like
# day or week.

In [100]:
# Conclusion and Improvements

In [99]:
# The model approach (LDA) has decent results, but the results are not as focused as I want them to be.
# The purpose of gaining insight into industry approaches for Topic Modeling is to apply these techniques to 
# the Reddit data and see the general topics that are discussed and the sentiment around stocks. As demonstrated above,
# the topics that have been provided have not yeilded many stocks in their topics (the exeption of "amp" and "tesla").
# To solve this and potentially improve this model, one approach that could be promising is called "seed words".
# This is when a seed word or words are added to the dictionary for each topic. This makes the model look for topics
# surrounding those words. 
# 
# Other approach 1:
# Topic pruning: From an LDA outputted topic, iteratively deleting outputted words from the original dataset then
# re-applying LDA to the resulatant dataset.

In [109]:
#

In [108]:
#

In [98]:
# Second Article

In [101]:
# The second paper to be examined is not a paper but an article. The article details the process a research assistant at 
# University of Massachusetts Amherst went through to apply Topic Modeling to his University's social media. The goal of 
# doing this was to analyize the topics that got the most engagment so the social media team can take that data into
# account in the future. 
#
# The method for Topic Modeling described in the article is a SVM (Support Vector Machine). The researcher first
# obtains the data and then preprocesses it by cleaning it, removing stop words, and tokenizing the text.
# They then fit the model and view the results.
# They also have a second function that improves the model. They do this by preprocessing
# the data by only including data that has engagement between the 75th quantile and 100th quantile.
# They then fit the model and look at the results again.

In [107]:
# code (Here I will recreate the steps the researcher took to make this model)

In [61]:
# Note: The data used in the article is private analytics from his university's social media, so I do not have access to it.
#       Alternatively I will use a social media comment section dataset as it is similar to the data used.
# I will be using an unclean version of the Reddit dataset that has had no preprocessing done on it.

In [None]:
#First model

In [58]:
def topic_model(file, colname, topics_num):
    
    import nltk
    import pandas as pd
    from nltk.corpus import stopwords
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import TruncatedSVD
    
    
    filedata = pd.read_csv(file)
    
    df = filedata[0:305000]
    
        # Clean the text
    df = df.dropna(subset=[colname])
    df['clean_title'] = df[colname].str.replace("[^a-zA-Z#]", " ")
    df['clean_title'] = df['clean_title'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
    df['clean_title'] = df['clean_title'].apply(lambda x: x.lower())
    
        # deal with the stop word
    stop_words = set(stopwords.words('english'))
    tokenized_doc = df['clean_title'].apply(lambda x: x.split())
    tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
    
        # merge the tokenized word back to sentences again
    detokenized_doc = []
    for i in range(len(df)):
        t = ' '.join(tokenized_doc[i])
        detokenized_doc.append(t)
    df['clean_title'] = detokenized_doc
    
        # vectorize it
    vectorizer = TfidfVectorizer(max_features= 500, # keep top500 terms 
                                 max_df = 0.5, 
                                 smooth_idf=True)
    X = vectorizer.fit_transform(df['clean_title'])
    
        # SVD represent documents and terms in vectors 
    svd_model = TruncatedSVD(n_components = topics_num, algorithm='randomized', n_iter=100, random_state=122)
    svd_model.fit(X)
    
    terms = vectorizer.get_feature_names()
    for i, comp in enumerate(svd_model.components_):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:6]
        print("Topic "+str(i)+": ")
        print('-------')
        for t in sorted_terms:
            print(t[0])
            print(' ')
            
            
    

In [54]:
secondfpathname = os.environ.get("secondfilepathname")

In [61]:
import warnings
warnings.filterwarnings('ignore')

In [62]:
g = topic_model(secondfpathname, "body", 4)
g

Topic 0: 
-------
removed
 
https
 
reddit
 
message
 
wallstreetbets
 
post
 
Topic 1: 
-------
deleted
 
like
 
post
 
comment
 
calls
 
account
 
Topic 2: 
-------
calls
 
like
 
puts
 
fuck
 
going
 
money
 
Topic 3: 
-------
calls
 
puts
 
bought
 
buying
 
sell
 
holding
 


In [17]:
# Second Function that is described in the article.

In [24]:
def topic_model_quantile(file, colname, metric_col, lower_quantile_no, upper_quantile_no ,topics_num):
    
    import nltk
    import pandas as pd
    from nltk.corpus import stopwords
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import TruncatedSVD
    
    df = pd.read_csv(file)
       
    df = df.dropna(subset=[colname])
    
    ##### Subset the data with quantile #####
    lower_quantile, upper_quantile =   df[metric_col].quantile([lower_quantile_no/100, upper_quantile_no/100])
    df = df.loc[(df[metric_col] > lower_quantile) & (df[metric_col] < upper_quantile)]
    
    df.reset_index(inplace=True)

In [35]:
# Sadly the above function cannot be performed on my dataset and get a similar result,
# This code is meant to be implemented on a dataset that has metric data that can be differentiated
# into quartiles.
# My dataset has three values for metric data, 0, 1, and 2, for awards received. 
# Furthermore the amount of comments with no awards to awards has a ratio of 25 to 1.
# the closest approximation for the desired outcome can be done in the following code.
# (essentially just any amount of awards > 0)

In [64]:
def topic_models(file, colname, topics_num):
    
    import nltk
    import pandas as pd
    from nltk.corpus import stopwords
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.decomposition import TruncatedSVD
    
    
    filedata = pd.read_csv(file)
    
    df = filedata[filedata.total_awards_received >= 1]
    df.reset_index(inplace=True)
    df = df[0:305000]
    
        # Clean the text
    df = df.dropna(subset=[colname])
    df['clean_title'] = df[colname].str.replace("[^a-zA-Z#]", " ")
    df['clean_title'] = df['clean_title'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
    df['clean_title'] = df['clean_title'].apply(lambda x: x.lower())
    
        # deal with the stop word
    stop_words = set(stopwords.words('english'))
    tokenized_doc = df['clean_title'].apply(lambda x: x.split())
    tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
    
        # merge the tokenized word back to sentences again
    detokenized_doc = []
    for i in range(len(df)):
        t = ' '.join(tokenized_doc[i])
        detokenized_doc.append(t)
    df['clean_title'] = detokenized_doc
    
        # vectorize it
    vectorizer = TfidfVectorizer(max_features= 500, # keep top500 terms 
                                 max_df = 0.5, 
                                 smooth_idf=True)
    X = vectorizer.fit_transform(df['clean_title'])
    
        # SVD represent documents and terms in vectors 
    svd_model = TruncatedSVD(n_components = topics_num, algorithm='randomized', n_iter=100, random_state=122)
    svd_model.fit(X)
    
    terms = vectorizer.get_feature_names()
    for i, comp in enumerate(svd_model.components_):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:6]
        print("Topic "+str(i)+": ")
        print('-------')
        for t in sorted_terms:
            print(t[0])
            print(' ')

In [65]:
y = topic_models(secondfpathname, "body", 4)
y

Topic 0: 
-------
sayter
 
type
 
fuck
 
awards
 
calls
 
elon
 
Topic 1: 
-------
fuck
 
tsla
 
calls
 
like
 
money
 
going
 
Topic 2: 
-------
fuck
 
holy
 
robinhood
 
award
 
gold
 
pull
 
Topic 3: 
-------
deleted
 
fuck
 
post
 
mods
 
shitty
 
want
 


In [106]:
# Result

In [105]:
# (The below comments are on the output from the topic_models() function)
#
# The code has given decent results.
# There are some patterns in the grouping of topics. For example the best category seems to be Topic 2. The topic here
# is Robinhood the stock trading platform. Those farmiliar with Robinhood and WallStreetBets know it was very popular with 
# them during this time period before their sentiment turned on it. Regardless, the grouping in Topic 2 has "robinhood"
# and "gold" in it. "gold" likely refers to the paid version of Robinhood called Robinhood Gold. It is possible that the
# "gold" refers to Reddit Gold. Further analysis is required to differentiate.
# 
# Topic 0 and Topic 1 seem to have cross over. Topic 0 includes "elon" and Topic 1 includes "tsla". There are many reasons
# for the overlap in topics, but the most likely reason is that the subreddit talks alot about Tesla and Elon and the
# sheer quantity of comments that contain them makes it so that it is seperable. To add to the confusion Elon is both a
# meme and a CEO. Which are two different topics.
#
# (Comment below is about the difference between the performance of topic_model() and topic_models())
# 
# The change between these two models is simply the preprocessing step of filtering for comments that gained some level
# of "engagment". The first SVM model (topic_model()) has no preprocessing and the topics are not great, they get alot of 
# the bad data like "removed" and "https". But The second SVM (topic_models()) only was trained on comments that had been
# filtered to have some "engagment" (ie awards). 

In [103]:
# Conclusion and Improvement 

In [104]:
# The conclusion I got from this model and approach is that preprocessing the data based on some metric can prove to be 
# a very affective decision when training a model for Topic Modeling. The output was night and day when it came to 
# gaining useful topics.
# 
# Ways to Imporve Performance:
#
# Using search to find the best threshold for what consitutes "engagement". In the code from the researcher they are looking
# at data that has "engagment" between the quartiles of the 75th and 100th. And in the code performed here the threshold
# was any comment with greater than 0 awards. With these conditions there was a boost in performance but the optimal
# threshold could improve output. Therefore performing search over various possible values for "engagment" could yeild
# better results.
# 

In [102]:
#

In [37]:
# References

#1.
#   Vala, A. R., Shayaa, S., & Babanejaddehaki, G. (2016). Topic Modeling for Social Media Content: A Practical Approach.
#   Department of Data Analytics, Berkshire Media, Kuala Lumpur, Malaysia; Faculty of Computer Science and Information
#   Technology, University Putra Malaysia

#2.
#   Feng, H. (2019). Decide Your Post Topics on Social Media with Simple Topic Modeling.
#   Retrieved from https://henryfeng.medium.com/decide-your-post-topics-on-social-media
#   -with-simple-topic-modeling-8ec1287d0eb9


In [36]:
# Reference links


# This is an academic article about using topic modeling for sentiment analysis on social media.
# https://www.researchgate.net/publication/311755685_Topic_modeling_for_social_media_content_A_practical_approach#:~:text=In%20this%20paper%2C%20we%20explore%20an%20unsupervised%20topic,topic%20facets%20and%20extracting%20their%20dynamics%20over%20time.


# This is an article that details the coding process of a topic model for a student researcher at a university.
# https://henryfeng.medium.com/decide-your-post-topics-on-social-media-with-simple-topic-modeling-8ec1287d0eb9