<a href="https://colab.research.google.com/github/esohman/Waseda-DH/blob/main/REDDIT_TOPIC_MODELING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Topic Modeling of Reddit comments on content policy updates from Reddit Admins** 
This notebook examines the shifts in topics in the comment responses to posts made by admins on reddit concerning their content policies. It takes five posts; two from 2015, one from 2017, one from 2018, and the final one from 2020. They all concern the nature of what kind of content is allowed to be posted on Reddit, and which measures will be taken against content that violates these policies.

LDA topic modeling will be used to see how the general discourse for Redditors has changed around limitations on content, and later, limitations on hate speech, or speech that incites violence against a particular group.

#**Step 1 : Scraping Reddit comment data and making them into dataframes for analysis**

In [None]:
!pip install praw #install praw for reddit scraping
import praw
from praw.models import MoreComments



In [None]:
#my reddit id:
reddit = praw.Reddit(client_id = 'CyjFm7g0LOnDDIavnWEL4Q', client_secret = 'jrZ48Ga_3w9j0aBQ2sUxzw1B66haNA', user_agent = 'ArisaWebScraping')

##**'Let's Talk Content. AMA.' in r/announcements, 2015:**

In [None]:
AMA2015 = reddit.submission(url = "https://www.reddit.com/r/announcements/comments/3djjxw/lets_talk_content_ama/") #open url of post

**Excerpt:** 

*One thing that isn't up for debate is why Reddit exists. Reddit is a place to have open and authentic discussions. The reason we’re careful to restrict speech is because people have more open and authentic discussions when they aren't worried about the speech police knocking down their door. When our purpose comes into conflict with a policy, we make sure our purpose wins.*

*As Reddit has grown, we've seen additional examples of how unfettered free speech can make Reddit a less enjoyable place to visit, and can even cause people harm outside of Reddit. Earlier this year, Reddit took a stand and banned non-consensual pornography. This was largely accepted by the community, and the world is a better place as a result (Google and Twitter have followed suit). Part of the reason this went over so well was because there was a very clear line of what was unacceptable.* (u/ in r/announcements)

**Let's create a dataframe of what the comment resonses are to this post.**

In [None]:
author_listAMA2015 = [] #empty list of usernames
score_listAMA2015 = [] #empty list of upvote counts
comment_listAMA2015 = [] #empty list of comments

In [None]:
AMA2015.comments.replace_more(limit = 0) #use praw to read through all comments on post
for comment in AMA2015.comments:
    comment_listAMA2015.append(comment.body) #add body of comment to comment list

for comment in AMA2015.comments:
  author_listAMA2015.append(comment.author) #add usernames of comment to author list
  score_listAMA2015.append(comment.score) #add upvote counts of comment to score list

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
import pandas as pd
AMA2015_dict = {'author': author_listAMA2015 , 'comment': comment_listAMA2015, 'upvote_count': score_listAMA2015} #turn lists to dict 
AMA15 = pd.DataFrame.from_dict(AMA2015_dict, orient = 'index') #turn dict to dataframe
AMA15 = AMA15.transpose() #it looks weird if you don't do this
AMA15 = AMA15[AMA15.comment != '[deleted]'] #drop deleted comments
AMA15 = AMA15[AMA15.author != '[deleted]'] #drop deleted comments

In [None]:
AMA15.head() 

Unnamed: 0,author,comment,upvote_count
0,,When will something be done about subreddit sq...,2909
1,biggmclargehuge,">-Things that are actually illegal, such as co...",1215
3,throwawaytiffany,Are all DMCA takedowns posted to /r/ChillingEf...,578
4,koproller,"Hi,\nFirst of all. Thanks for doing this AMA. ...",2164
5,XIGRIMxREAPERIX,/u/spez\nI am confused on the illegal portion....,1095


## **'Content Policy Update' in r/announcements, 2015:**

In [None]:
submission2015 = reddit.submission(url = "https://www.reddit.com/r/announcements/comments/3fx2au/content_policy_update/") #open url of post

**Excerpt:**

*Our policies are not changing dramatically from what we have had in the past. One new concept is Quarantining a community, which entails applying a set of restrictions to a community so its content will only be viewable to those who explicitly opt in. We will Quarantine communities whose content would be considered extremely offensive to the average redditor.*

*Today, in addition to applying Quarantines, we are banning a handful of communities that exist solely to annoy other redditors, prevent us from improving Reddit, and generally make Reddit worse for everyone else. Our most important policy over the last ten years has been to allow just about anything so long as it does not prevent others from enjoying Reddit for what it is: the best place online to have truly authentic conversations.* (u/spez in r/announcements)

###**Let's create a dataframe of what the comment resonses are to this post.**

In [None]:
author_list2015 = [] #empty list of usernames
score_list2015 = [] #empty list of upvote count
comment_list2015 = [] #empty list of comments

In [None]:
submission2015.comments.replace_more(limit = 0) #use praw to read through all comments on post
for comment in submission2015.comments: 
    comment_list2015.append(comment.body) #add body of comment to comment list

for comment in submission2015.comments:
  author_list2015.append(comment.author) #add username of comment to author list
  score_list2015.append(comment.score) #add upvote count of comment to score list

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
content2015_dict = {'author': author_list2015 , 'comment': comment_list2015, 'upvote_count': score_list2015} #turn lists to dict
content15 = pd.DataFrame.from_dict(content2015_dict, orient = 'index') #turn dict to dataframe
content15 = content15.transpose()
content15 = content15[content15.comment != '[deleted]'] #drop deleted comments
content15 = content15[content15.author != '[deleted]'] #drop deleted comments

In [None]:
content15.head() 

Unnamed: 0,author,comment,upvote_count
0,Cheech5,"> Today, in addition to applying Quarantines, ...",3736
2,BillW87,For the sake of transparency I feel like it wo...,1968
4,TheMentalist10,Will you be sharing information about the comm...,1753
5,dwchief,If a user is subscribed to a Quarantined subre...,1083
7,username3,"""the average redditor"" \nYikes\n",387


##**'Update on site-wide rules regarding violent content' in r/modnews, 2017:**

In [None]:
submission2017 = reddit.submission(url = "https://www.reddit.com/r/modnews/comments/78p7bz/update_on_sitewide_rules_regarding_violent_content/") #open url of post

**Excerpt:**

*In particular, we found that the policy regarding “inciting” violence was too vague, and so we have made an effort to adjust it to be more clear and comprehensive. Going forward, we will take action against any content that encourages, glorifies, incites, or calls for violence or physical harm against an individual or a group of people; likewise, we will also take action against content that glorifies or encourages the abuse of animals. This applies to ALL content on Reddit, including memes, CSS/community styling, flair, subreddit names, and usernames.* (u/landoflobsters in r/modnews)

###**Let's create a dataframe of what the comment resonses are to this post.**

In [None]:
author_list2017 = [] #empty list of usernames
score_list2017 = [] #empty list of upvotes
comment_list2017 = [] #empty list of comments

In [None]:
submission2017.comments.replace_more(limit = 0) #use praw to read through all comments on post
for comment in submission2017.comments:
    comment_list2017.append(comment.body) #add body of comment to comment list

for comment in submission2017.comments:
  author_list2017.append(comment.author) #add username of comment to author list
  score_list2017.append(comment.score) #add upvote count of comment to score list

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
content2017_dict = {'author': author_list2017 , 'comment': comment_list2017, 'upvote_count': score_list2017} #turn list to dict 
content17 = pd.DataFrame.from_dict(content2017_dict, orient = 'index') #turn dict to dataframe
content17 = content17.transpose()
content17 = content17[content17.comment != '[deleted]'] #drop deleted comments
content17 = content17[content17.author != '[deleted]'] #drop deleted comments

In [None]:
content17.head()

Unnamed: 0,author,comment,upvote_count
0,Deimorz,Why is this posted in /r/modnews and not /r/an...,2344
1,ridddle,What does glorify mean? Will subs like watchpe...,1135
2,turikk,"How will the exact phrase ""kill your self"" be ...",307
3,Grickit,This cycle is so tiring\n\n1) reddit admins to...,3655
4,ShaneH7646,Can we get a link to the updated rules?,195


##**'Revamping the Quarantine Function' in r/announcements, 2018:**

In [None]:
submission2018 = reddit.submission(url = "https://www.reddit.com/r/announcements/comments/9jf8nh/revamping_the_quarantine_function/") #open post

**Excerpt:**

*On a platform as open and diverse as Reddit, there will sometimes be communities that, while not prohibited by the Content Policy, average redditors may nevertheless find highly offensive or upsetting. In other cases, communities may be dedicated to promoting hoaxes (yes we used that word) that warrant additional scrutiny, as there are some things that are either verifiable or falsifiable and not seriously up for debate (eg, the Holocaust did happen and the number of people who died is well documented). In these circumstances, Reddit administrators may apply a quarantine.*

*The purpose of quarantining a community is to prevent its content from being accidentally viewed by those who do not knowingly wish to do so, or viewed without appropriate context. We’ve also learned that quarantining a community may have a positive effect on the behavior of its subscribers by publicly signaling that there is a problem. This both forces subscribers to reconsider their behavior and incentivizes moderators to make changes.* (u/landoflobsters in r/announcements)

###**Let's create a dataframe of what the comment resonses are to this post.**

In [None]:
author_list2018 = [] #empty username list
score_list2018 = [] #empty upvote list
comment_list2018 = [] #empty comment list

In [None]:
submission2018.comments.replace_more(limit = 0) #use praw to read all comments on post
for comment in submission2018.comments:
    comment_list2018.append(comment.body) #add comment body to comment list

for comment in submission2018.comments:
  author_list2018.append(comment.author) #add username of comment to author list
  score_list2018.append(comment.score) #add upvote count of comment to score list

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
content2018_dict = {'author': author_list2018 , 'comment': comment_list2018, 'upvote_count': score_list2018} #turn list to dict
content18 = pd.DataFrame.from_dict(content2018_dict, orient = 'index') #turn dict to dataframe
content18 = content18.transpose()
content18 = content18[content18.comment != '[deleted]'] #drop deleted comments
content18 = content18[content18.author != '[deleted]'] #drop deleted comments

In [None]:
content18.head()

Unnamed: 0,author,comment,upvote_count
0,throwhfhsjsubendaway,Quarantined subs are completely inaccessible o...,630
1,Jackgabbay,Guys can we get every sub quarantined? No ads!,251
2,SicariusXLVII,"To avoid going in too deep on this mess, how a...",189
3,5th_Law_of_Robotics,r/againstmensrights is a blatant hate sub and ...,145
4,MikeOxlong209,I am seriously livid that I can’t view WPD on ...,294


##**'Content Policy Update' in r/announcements, 2020:**

In [None]:
submission2020 = reddit.submission(url = "https://www.reddit.com/r/announcements/comments/hi3oht/update_to_our_content_policy/") #open post

**Excerpt:**

*From our conversations with mods and outside experts, it’s clear that while we’ve gotten better in some areas—like actioning violations at the community level, scaling enforcement efforts, measurably reducing hateful experiences like harassment year over year—we still have a long way to go to address the gaps in our policies and enforcement to date.*

*These include addressing questions our policies have left unanswered (like whether hate speech is allowed or even protected on Reddit), aspects of our product and mod tools that are still too easy for individual bad actors to abuse (inboxes, chats, modmail), and areas where we can do better to partner with our mods and communities who want to combat the same hateful conduct we do.*

*Ultimately, it’s our responsibility to support our communities by taking stronger action against those who try to weaponize parts of Reddit against other people. In the near term, this support will translate into some of the product work we discussed with mods. But it starts with dealing squarely with the hate we can mitigate today through our policies and enforcement.* (u/spez in r/announcements)

###**Let's create a dataframe of what the comment resonses are to this post.**

In [None]:
author_list2020 = [] #empty username list
score_list2020 = [] #empty upvote count list
comment_list2020 = [] #empty comment list

In [None]:
submission2020.comments.replace_more(limit = 0) #use praw to read all comments on post
for comment in submission2020.comments:
    comment_list2020.append(comment.body) #add content of comment to comment list

for comment in submission2020.comments:
  author_list2020.append(comment.author) #add username of comment to author list
  score_list2020.append(comment.score) #add upvote count of comment to score list

It appears that you are using PRAW in an asynchronous environment.
It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



In [None]:
content2020_dict = {'author': author_list2020 , 'comment': comment_list2020, 'upvote_count': score_list2020} #turn list to dict
content20 = pd.DataFrame.from_dict(content2020_dict, orient = 'index') #turn dict to dataframe
content20 = content20.transpose()
content20 = content20[content20.comment != '[deleted]'] #drop deleted comments
content20 = content20[content20.author != '[deleted]'] #drop deleted comments

In [None]:
content20.head()

Unnamed: 0,author,comment,upvote_count
0,,Why are the names of most subs censored,1673
1,jomohoe,"Holy shit, I can't believe that initial post a...",5149
2,1984IndianExmuslim,I ~~have~~ had a tiny sub (less than 3000 subs...,309
3,Great_LD,"What, if anything will be done about harassmen...",3446
4,Shegham,YOU HAVEN'T DONE SHIT TO ANY OF THE R/Politics...,50


Now that we have our data all neatly organized, we can start using it for topic modeling to see exactly what topics are coming up in response to Reddit's content policy updates!

#**Step Two : Topic Modeling for Reddit comment responses to content policy updates.**

The topic modeling code I use is adapted from the [notebook](https://colab.research.google.com/drive/1LXCKl9CAQvAkIskdOXGwlXwagnXG29t2?usp=sharing) provided by machinelearningplus.com. I'll first begin my downloading the necessary stuff to start topic modeling. We will be using the top 5 topics from each model for the later paper.



In [None]:
import nltk; nltk.download('stopwords') #stopwords from nltk to clean tokens later

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
!pip install pyLDAvis #pyLDAvis and gensim install to use for topic modeling
!pip install gensim



###***You might have to restart the runtime and run all cells again at this point!!!!!***

In [None]:
import re             #necessary stuff
import numpy as np 
import pandas as pd
from pprint import pprint

import gensim         #gensim for topic modeling
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel 

import spacy        #spacy to process data

import pyLDAvis     #pyLDAvis for topic modeling
import pyLDAvis.gensim_models as gensimvis
import matplotlib.pyplot as plt
%matplotlib inline

import logging      #enable logging for gensim
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level = logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category = DeprecationWarning)

  from collections import Iterable


In [None]:
from nltk.corpus import stopwords #create stopwords list
stop_words = stopwords.words('english')
stop_words.extend(['reddit', 'site', 'edit','com', 'link', 'website', 'r/announcements', 'html', 'r/modnews', 'www', 'spez', 'landoflobsters']) 
#add extra irrelevant words potentially in comments

Functions to tokenize and remove stopwords from data:

In [None]:
def sent_to_words(sentences): #tokenize the comments
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence).encode('utf-8'), deacc = True))

def remove_stopwords(texts): #remove stopwords from tokens
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

nlp = spacy.load('en_core_web_sm', disable = ['parser', 'ner']) #establish nlp
#print(stop_words)

#**Topic Modeling for 'Let's Talk Content. AMA.' in r/announcements, 2015:**

###**Preparing the data:**

In [None]:
dataAMA2015 = AMA15.comment.values.tolist() #turn the comment values in our data frame into a list
dataAMA2015 = [re.sub('\s+', ' ', sent) for sent in dataAMA2015] #get rid of newline characters
dataAMA2015 = [re.sub("\'"," ", sent) for sent in dataAMA2015] #get rid of single quotes

In [None]:
data_wordsAMA2015 = list(sent_to_words(dataAMA2015))
print(data_wordsAMA2015[:1])

[['when', 'will', 'something', 'be', 'done', 'about', 'subreddit', 'squatters', 'the', 'existing', 'system', 'is', 'not', 'working', 'qgyh', 'is', 'able', 'to', 'retain', 'top', 'mod', 'of', 'many', 'defaults', 'and', 'large', 'subreddits', 'just', 'because', 'he', 'posts', 'comment', 'every', 'two', 'months', 'this', 'is', 'harming', 'reddit', 'as', 'community', 'when', 'lower', 'mods', 'are', 'veto', 'and', 'removed', 'by', 'someone', 'who', 'is', 'only', 'mod', 'for', 'the', 'power', 'trip', 'will', 'something', 'be', 'done', 'about', 'this']]


###**Building bigrams and trigrams:**

In [None]:
bigramAMA2015 = gensim.models.Phrases(data_wordsAMA2015, min_count = 5, threshold = 100) 
trigramAMA2015 = gensim.models.Phrases(bigramAMA2015[data_wordsAMA2015], threshold = 100) 
bigram_modAMA2015 = gensim.models.phrases.Phraser(bigramAMA2015)
trigram_modAMA2015 = gensim.models.phrases.Phraser(trigramAMA2015)



###**Removing stopwords, making bigrams and trigrams, and lemmatizing:**

In [None]:
def make_bigramsAMA2015(texts):
    return [bigram_modAMA2015[doc] for doc in texts]

def make_trigramsAMA2015(texts):
    return [trigram_modAMA2015[doc] for doc in texts]

def lemmatizationAMA2015(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
data_words_nostopsAMA2015 = remove_stopwords(data_wordsAMA2015)
data_words_bigramsAMA2015 = make_bigramsAMA2015(data_words_nostopsAMA2015)
data_words_trigramsAMA2015 = make_trigramsAMA2015(data_words_bigramsAMA2015)
data_lemmatizedAMA2015 = lemmatizationAMA2015(data_words_trigramsAMA2015, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatizedAMA2015[:1])

[['do', 'subreddit', 'squatter', 'exist', 'system', 'work', 'qgyh', 'able', 'top', 'many', 'default', 'large', 'subreddit', 'post', 'comment', 'month', 'harm', 'community', 'low', 'mod', 'veto', 'remove', 'trip', 'do']]


In [None]:
id2wordAMA2015 = corpora.Dictionary(data_lemmatizedAMA2015)
textsAMA2015 = data_lemmatizedAMA2015
corpusAMA2015 = [id2wordAMA2015.doc2bow(text) for text in textsAMA2015]

###**Build the LDA Model:**

In [None]:
lda_modelAMA2015 = gensim.models.ldamodel.LdaModel(corpus = corpusAMA2015,
                                           id2word = id2wordAMA2015,
                                           num_topics = 20, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 30,
                                           passes = 200,
                                           alpha = 'auto',
                                           per_word_topics = True)

##**View LDA Topics for 'Let's Talk Content. AMA.' in r/announcements, 2015:**

In [None]:
pprint(lda_modelAMA2015.print_topics())
doc_ldaAMA2015 = lda_modelAMA2015[corpusAMA2015]

[(0,
  '0.061*"enforce" + 0.031*"exactly" + 0.031*"intend" + 0.026*"offer" + '
  '0.021*"rule" + 0.020*"statement" + 0.020*"still" + 0.017*"take" + '
  '0.016*"make" + 0.011*"remove"'),
 (1,
  '0.011*"agenda" + 0.011*"republican" + 0.011*"protect" + 0.011*"idea" + '
  '0.011*"news" + 0.011*"shadowbanned" + 0.011*"redditblog" + 0.011*"promote" '
  '+ 0.011*"http" + 0.001*"conservative"'),
 (2,
  '0.001*"drug" + 0.001*"diminish" + 0.001*"malicious" + 0.001*"line" + '
  '0.001*"gameplay" + 0.001*"game" + 0.001*"footage" + 0.001*"movie" + '
  '0.001*"blurry" + 0.001*"activity"'),
 (3,
  '0.001*"drug" + 0.001*"diminish" + 0.001*"malicious" + 0.001*"line" + '
  '0.001*"gameplay" + 0.001*"game" + 0.001*"footage" + 0.001*"movie" + '
  '0.001*"blurry" + 0.001*"activity"'),
 (4,
  '0.070*"need" + 0.054*"illegal" + 0.045*"other" + 0.036*"content" + '
  '0.026*"define" + 0.026*"may" + 0.024*"type" + 0.024*"abuse" + 0.024*"much" '
  '+ 0.023*"common"'),
 (5,
  '0.030*"person" + 0.027*"must" + 0.025

###**Get perplexity score and coherence score:**

In [None]:
print('\nPerplexity: ', lda_modelAMA2015.log_perplexity(corpusAMA2015))
coherence_model_ldaAMA2015 = CoherenceModel(model = lda_modelAMA2015, texts = data_lemmatizedAMA2015, dictionary = id2wordAMA2015, coherence = 'c_v')
coherence_ldaAMA2015 = coherence_model_ldaAMA2015.get_coherence()
print('\nCoherence Score: ', coherence_ldaAMA2015)


Perplexity:  -7.408235868576112

Coherence Score:  0.46026506927198313


##**Visualize LDA Topics for 'Let's Talk Content. AMA.' in r/announcements, 2015:**

In [None]:
pyLDAvis.enable_notebook()
ldaAMA2015_prepared = gensimvis.prepare(lda_modelAMA2015, corpusAMA2015, id2wordAMA2015)
ldaAMA2015_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


#**Topic Modeling for 'Content Policy Update' in r/announcements, 2015:**

###**Preparing the data:**

In [None]:
data2015 = content15.comment.values.tolist()
data2015 = [re.sub('\s+', ' ', sent) for sent in data2015]
data2015 = [re.sub("\'"," ", sent) for sent in data2015]

In [None]:
data_words2015 = list(sent_to_words(data2015))
print(data_words2015[:1])

[['today', 'in', 'addition', 'to', 'applying', 'quarantines', 'we', 'are', 'banning', 'handful', 'of', 'communities', 'that', 'exist', 'solely', 'to', 'annoy', 'other', 'redditors', 'prevent', 'us', 'from', 'improving', 'reddit', 'and', 'generally', 'make', 'reddit', 'worse', 'for', 'everyone', 'else', 'our', 'most', 'important', 'policy', 'over', 'the', 'last', 'ten', 'years', 'has', 'been', 'to', 'allow', 'just', 'about', 'anything', 'so', 'long', 'as', 'it', 'does', 'not', 'prevent', 'others', 'from', 'enjoying', 'reddit', 'for', 'what', 'it', 'is', 'the', 'best', 'place', 'online', 'to', 'have', 'truly', 'authentic', 'conversations', 'which', 'communities', 'have', 'been', 'banned']]


###**Building bigrams and trigrams:**

In [None]:
bigram2015 = gensim.models.Phrases(data_words2015, min_count = 5, threshold = 100) 
trigram2015 = gensim.models.Phrases(bigram2015[data_words2015], threshold = 100) 
bigram_mod2015 = gensim.models.phrases.Phraser(bigram2015)
trigram_mod2015 = gensim.models.phrases.Phraser(trigram2015)



###**Removing stopwords, making bigrams and trigrams, and lemmatizing:**

In [None]:
def make_bigrams2015(texts):
    return [bigram_mod2015[doc] for doc in texts]

def make_trigrams2015(texts):
    return [trigram_mod2015[doc] for doc in texts]

def lemmatization2015(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
#bigrams, stopwords, lemmatize for 2015
data_words_nostops2015 = remove_stopwords(data_words2015)
data_words_bigrams2015 = make_bigrams2015(data_words_nostops2015)
data_words_trigrams2015 = make_trigrams2015(data_words_bigrams2015)
data_lemmatized2015 = lemmatization2015(data_words_trigrams2015, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized2015[:1])

[['today', 'addition', 'community', 'exist_solely', 'annoy', 'redditors_prevent_us', 'improve', 'generally_make', 'bad', 'important', 'policy', 'last', 'year', 'allow', 'long', 'prevent', 'other', 'enjoy', 'good', 'place', 'online', 'truly_authentic_conversation', 'community', 'ban']]


###**Creating dictionary and corpus:**

In [None]:
#creating dictionary, corpus, freq for 2015
id2word2015 = corpora.Dictionary(data_lemmatized2015)
texts2015 = data_lemmatized2015
corpus2015 = [id2word2015.doc2bow(text) for text in texts2015]
print(corpus2015[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1)]]


###**Build the LDA Model**

In [None]:
lda_model2015 = gensim.models.ldamodel.LdaModel(corpus = corpus2015,
                                           id2word = id2word2015,
                                           num_topics = 20, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 30,
                                           passes = 200,
                                           alpha = 'auto',
                                           per_word_topics = True)

##**View the topics for 'Content Policy Update' in r/announcements, 2015:**

In [None]:
#give us the topics
pprint(lda_model2015.print_topics())
doc_lda2015 = lda_model2015[corpus2015]

[(0,
  '0.076*"decide" + 0.073*"use" + 0.037*"threaten" + 0.037*"vague" + '
  '0.028*"encourage" + 0.021*"bully" + 0.021*"constitute" + 0.016*"around" + '
  '0.015*"action" + 0.014*"realize"'),
 (1,
  '0.114*"ban" + 0.061*"sub" + 0.058*"community" + 0.051*"subreddit" + '
  '0.039*"bad" + 0.035*"people" + 0.033*"annoy" + 0.030*"rule" + '
  '0.022*"exist_solely" + 0.022*"know"'),
 (2,
  '0.036*"brigade" + 0.030*"harass" + 0.025*"time" + 0.022*"child" + '
  '0.021*"little" + 0.017*"speech" + 0.017*"free" + 0.016*"leave" + '
  '0.016*"redditor" + 0.015*"update"'),
 (3,
  '0.033*"understand" + 0.022*"problem" + 0.022*"maybe" + 0.020*"think" + '
  '0.019*"solely" + 0.018*"harmful" + 0.017*"simple" + 0.017*"respect" + '
  '0.016*"behavior" + 0.014*"drawing"'),
 (4,
  '0.034*"break" + 0.024*"soon" + 0.022*"great" + 0.021*"correct" + '
  '0.020*"continue" + 0.017*"shitredditsay" + 0.017*"hate" + 0.015*"dark" + '
  '0.014*"thousand" + 0.014*"full"'),
 (5,
  '0.023*"story" + 0.022*"delete" + 0.02

###**Get perplexity score and coherence score:**

In [None]:
print('\nPerplexity: ', lda_model2015.log_perplexity(corpus2015))
coherence_model_lda2015 = CoherenceModel(model = lda_model2015, texts = data_lemmatized2015, dictionary = id2word2015, coherence = 'c_v')
coherence_lda2015 = coherence_model_lda2015.get_coherence()
print('\nCoherence Score: ', coherence_lda2015)


Perplexity:  -7.276842279325897

Coherence Score:  0.40848219445192446


##**Visualize the topics for 'Content Policy Update' in r/announcements, 2015:**

In [None]:
pyLDAvis.enable_notebook()
lda2015_prepared = gensimvis.prepare(lda_model2015, corpus2015, id2word2015)
lda2015_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


#**Topic Modeling for 'Update on site-wide rules regarding violent content' in r/modnews, 2017:**

###**Preparing the data:**

In [None]:
data2017 = content17.comment.values.tolist()
data2017 = [re.sub('\s+', ' ', sent) for sent in data2017]
data2017 = [re.sub("\'"," ", sent) for sent in data2017]

In [None]:
data_words2017 = list(sent_to_words(data2017))

###**Building bigrams and trigrams:**

In [None]:
bigram2017 = gensim.models.Phrases(data_words2017, min_count = 5, threshold = 100) 
trigram2017 = gensim.models.Phrases(bigram2017[data_words2017], threshold = 100) 
bigram_mod2017 = gensim.models.phrases.Phraser(bigram2017)
trigram_mod2017 = gensim.models.phrases.Phraser(trigram2017)



###**Removing stopwords, making bigrams and trigrams, and lemmatizing:**

In [None]:
def make_bigrams2017(texts):
    return [bigram_mod2017[doc] for doc in texts]

def make_trigrams2017(texts):
    return [trigram_mod2017[doc] for doc in texts]

def lemmatization2017(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
data_words_nostops2017 = remove_stopwords(data_words2017)
data_words_bigrams2017 = make_bigrams2017(data_words_nostops2017)
data_words_trigrams2017 = make_trigrams2017(data_words_bigrams2017)
data_lemmatized2017 = lemmatization2017(data_words_trigrams2017, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized2017[:1])

[['post', 'announcement', 'user', 'inform', 'wide', 'rule', 'change', 'moderator']]


###**Creating dictionary and corpus:**

In [None]:
id2word2017 = corpora.Dictionary(data_lemmatized2017)
texts2017 = data_lemmatized2017
corpus2017 = [id2word2017.doc2bow(text) for text in texts2017]

###**Build the LDA Model**

In [None]:
lda_model2017 = gensim.models.ldamodel.LdaModel(corpus = corpus2017,
                                           id2word = id2word2017,
                                           num_topics = 20, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 30,
                                           passes = 200,
                                           alpha = 'auto',
                                           per_word_topics = True)

##**View the topics for 'Update on site-wide rules regarding violent content' in r/modnews, 2017:**

In [None]:
pprint(lda_model2017.print_topics())
doc_lda2017 = lda_model2017[corpus2017]

[(0,
  '0.048*"would" + 0.045*"prohibit" + 0.045*"view" + 0.034*"people" + '
  '0.033*"mention" + 0.031*"rule" + 0.031*"violence" + 0.031*"require" + '
  '0.031*"change" + 0.016*"hold"'),
 (1,
  '0.123*"remove" + 0.002*"uncensorednew" + 0.002*"bad" + 0.002*"block" + '
  '0.002*"chopping" + 0.002*"person" + 0.002*"sigh" + 0.002*"context" + '
  '0.002*"also" + 0.002*"represent"'),
 (2,
  '0.049*"number" + 0.037*"phone" + 0.025*"know" + 0.025*"fuck" + '
  '0.025*"source" + 0.013*"bit" + 0.013*"law" + 0.013*"bcnd" + 0.013*"faith" + '
  '0.013*"behavior"'),
 (3,
  '0.044*"kill" + 0.041*"call" + 0.031*"die" + 0.029*"finally" + '
  '0.029*"remove" + 0.020*"clarification" + 0.020*"really" + 0.020*"literally" '
  '+ 0.020*"dad" + 0.020*"specifically"'),
 (4,
  '0.078*"many" + 0.078*"clarify" + 0.002*"political" + 0.002*"uncensorednew" '
  '+ 0.002*"chopping" + 0.002*"person" + 0.002*"bad" + 0.002*"upset" + '
  '0.002*"else" + 0.002*"encouragement"'),
 (5,
  '0.117*"mean" + 0.070*"action" + 0.04

###**Get perplexity score and coherence score:**

In [None]:
print('\nPerplexity: ', lda_model2017.log_perplexity(corpus2017))
coherence_model_lda2017 = CoherenceModel(model = lda_model2017, texts = data_lemmatized2017, dictionary = id2word2017, coherence = 'c_v')
coherence_lda2017 = coherence_model_lda2017.get_coherence()
print('\nCoherence Score: ', coherence_lda2017)


Perplexity:  -6.49595643267927

Coherence Score:  0.4870299886552071


##**Visualize the topics for 'Update on site-wide rules regarding violent content' in r/modnews, 2017:**

In [None]:
pyLDAvis.enable_notebook()
lda2017_prepared = gensimvis.prepare(lda_model2017, corpus2017, id2word2017, mds='mmds')
lda2017_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


#**Topic Modeling for 'Revamping the Quarantine Function' in r/announcements, 2018:**

###**Preparing the data:**

In [None]:
#topic modeling for 2018
data2018 = content18.comment.values.tolist()
data2018 = [re.sub('\s+', ' ', sent) for sent in data2018]
data2018 = [re.sub("\'"," ", sent) for sent in data2018]

In [None]:
data_words2018 = list(sent_to_words(data2018))

###**Building bigrams and trigrams:**

In [None]:
#bigram models for 2018
bigram2018 = gensim.models.Phrases(data_words2018, min_count = 5, threshold = 100) 
trigram2018 = gensim.models.Phrases(bigram2018[data_words2018], threshold = 100) 
bigram_mod2018 = gensim.models.phrases.Phraser(bigram2018)
trigram_mod2018 = gensim.models.phrases.Phraser(trigram2018)



###**Removing stopwords, making bigrams and trigrams, and lemmatizing:**

In [None]:
def make_bigrams2018(texts):
    return [bigram_mod2018[doc] for doc in texts]

def make_trigrams2018(texts):
    return [trigram_mod2018[doc] for doc in texts]

def lemmatization2018(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
data_words_nostops2018 = remove_stopwords(data_words2018)
data_words_bigrams2018 = make_bigrams2015(data_words_nostops2018)
data_words_trigrams2018 = make_trigrams2018(data_words_bigrams2018)
data_lemmatized2018 = lemmatization2018(data_words_trigrams2018, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized2018[:1])

[['quarantine', 'sub', 'completely', 'inaccessible', 'mobile', 'app']]


###**Creating dictionary and corpus:**

In [None]:
id2word2018 = corpora.Dictionary(data_lemmatized2018)
texts2018 = data_lemmatized2018
corpus2018 = [id2word2018.doc2bow(text) for text in texts2018]

###**Build the LDA Model:**

In [None]:
lda_model2018 = gensim.models.ldamodel.LdaModel(corpus = corpus2018,
                                           id2word = id2word2018,
                                           num_topics = 20, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 30,
                                           passes = 200,
                                           alpha = 'auto',
                                           per_word_topics = True)

##**View the topics for 'Revamping the Quarantine Function' in r/announcements, 2018:**

In [None]:
pprint(lda_model2018.print_topics())
doc_lda2018 = lda_model2018[corpus2018]

[(0,
  '0.089*"read" + 0.029*"measure" + 0.023*"subject" + 0.019*"grow" + '
  '0.017*"email" + 0.016*"break" + 0.009*"self" + 0.009*"sunshine" + '
  '0.009*"plain" + 0.009*"reasonable"'),
 (1,
  '0.053*"become" + 0.042*"subreddit" + 0.035*"guy" + 0.030*"prevent" + '
  '0.030*"content" + 0.025*"shit" + 0.024*"decision" + 0.023*"ad" + '
  '0.021*"appropriate" + 0.021*"context"'),
 (2,
  '0.050*"step" + 0.036*"place" + 0.023*"congratulation" + 0.016*"purely" + '
  '0.016*"religion" + 0.016*"politic" + 0.015*"disgust" + 0.015*"million" + '
  '0.015*"parent" + 0.015*"reddit"'),
 (3,
  '0.082*"political" + 0.066*"internet" + 0.060*"good" + 0.046*"stupid" + '
  '0.034*"far" + 0.024*"much" + 0.023*"think" + 0.020*"obvious" + 0.020*"line" '
  '+ 0.018*"really"'),
 (4,
  '0.020*"freedom" + 0.017*"let" + 0.015*"believe" + 0.014*"idea" + '
  '0.014*"violence" + 0.012*"propaganda" + 0.011*"do" + 0.010*"company" + '
  '0.010*"tool" + 0.009*"amount"'),
 (5,
  '0.064*"many" + 0.054*"stop" + 0.040*"pag

###**Get perplexity and coherence score:**

In [None]:
print('\nPerplexity: ', lda_model2018.log_perplexity(corpus2018))
coherence_model_lda2018 = CoherenceModel(model = lda_model2018, texts = data_lemmatized2018, dictionary = id2word2018, coherence = 'c_v')
coherence_lda2018 = coherence_model_lda2018.get_coherence()
print('\nCoherence Score: ', coherence_lda2018)


Perplexity:  -7.405006158622198

Coherence Score:  0.4271588835599879


##**Visualize the topics for 'Revamping the Quarantine Function' in r/announcements, 2018:**

In [None]:
pyLDAvis.enable_notebook()
lda2018_prepared = gensimvis.prepare(lda_model2018, corpus2018, id2word2018)
lda2018_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)


#**Topic Modeling for 'Content Policy Update' in r/announcements, 2020:**

###**Preparing the data:**

In [None]:
#topic modeling for 2020
data2020 = content20.comment.values.tolist()
data2020 = [re.sub('\s+', ' ', sent) for sent in data2020]
data2020 = [re.sub("\'"," ", sent) for sent in data2020]

In [None]:
data_words2020 = list(sent_to_words(data2020))

###**Building bigrams and trigrams:**

In [None]:
bigram2020 = gensim.models.Phrases(data_words2020, min_count = 5, threshold = 100) 
trigram2020 = gensim.models.Phrases(bigram2020[data_words2020], threshold = 100) 
bigram_mod2020 = gensim.models.phrases.Phraser(bigram2020)
trigram_mod2020 = gensim.models.phrases.Phraser(trigram2020)



###**Removing stopwords, making bigrams and trigrams, and lemmatizing:**

In [None]:
def make_bigrams2020(texts):
    return [bigram_mod2020[doc] for doc in texts]

def make_trigrams2020(texts):
    return [trigram_mod2020[doc] for doc in texts]

def lemmatization2020(texts, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
data_words_nostops2020 = remove_stopwords(data_words2020)
data_words_bigrams2020 = make_bigrams2020(data_words_nostops2020)
data_words_trigrams2020 = make_trigrams2020(data_words_bigrams2020)
data_lemmatized2020 = lemmatization2020(data_words_trigrams2020, allowed_postags = ['NOUN', 'ADJ', 'VERB', 'ADV'])

###**Creating dictionary and corpus:**

In [None]:
id2word2020 = corpora.Dictionary(data_lemmatized2020)
texts2020 = data_lemmatized2020
corpus2020 = [id2word2020.doc2bow(text) for text in texts2020]

###**Build the LDA Model:**

In [None]:
lda_model2020 = gensim.models.ldamodel.LdaModel(corpus = corpus2020,
                                           id2word = id2word2020,
                                           num_topics = 20, 
                                           random_state = 100,
                                           update_every = 1,
                                           chunksize = 40,
                                           passes = 200,
                                           alpha = 'auto',
                                           per_word_topics = True)

##**View the topics for 'Content Policy Update' in r/announcements, 2020:**

In [None]:
pprint(lda_model2020.print_topics())
doc_lda2020 = lda_model2020[corpus2020]

[(0,
  '0.030*"post" + 0.026*"hatespeech" + 0.017*"massive" + 0.017*"trans" + '
  '0.017*"hundred" + 0.009*"illness" + 0.009*"locker" + 0.009*"low" + '
  '0.009*"ridiculing" + 0.009*"ratio"'),
 (1,
  '0.087*"group" + 0.073*"hate" + 0.058*"people" + 0.057*"rule" + '
  '0.056*"protect" + 0.045*"majority" + 0.031*"identity" + 0.026*"base" + '
  '0.025*"promote" + 0.025*"attack"'),
 (2,
  '0.101*"black" + 0.037*"person" + 0.016*"watch" + 0.016*"solely" + '
  '0.015*"merit" + 0.009*"definition" + 0.008*"treatment" + 0.008*"thumb" + '
  '0.008*"twitter" + 0.008*"similar"'),
 (3,
  '0.031*"can" + 0.023*"run" + 0.020*"club" + 0.018*"abuse" + '
  '0.018*"harassment" + 0.018*"moderator" + 0.013*"ideology" + 0.012*"fly" + '
  '0.011*"power" + 0.010*"ban"'),
 (4,
  '0.074*"private" + 0.058*"still" + 0.047*"process" + 0.028*"see" + '
  '0.020*"check" + 0.019*"die" + 0.014*"point" + 0.011*"genocidal" + '
  '0.011*"running" + 0.010*"well"'),
 (5,
  '0.030*"go" + 0.023*"policy" + 0.021*"comment" + 0.0

###**Get perplexity and coherence scores:**

In [None]:
print('\nPerplexity: ', lda_model2020.log_perplexity(corpus2020))
coherence_model_lda2020 = CoherenceModel(model = lda_model2020, texts = data_lemmatized2020, dictionary = id2word2020, coherence = 'c_v')
coherence_lda2020 = coherence_model_lda2020.get_coherence()
print('\nCoherence Score: ', coherence_lda2020)


Perplexity:  -7.026395856295515

Coherence Score:  0.43356785721148333


##**Visualize the topics for 'Content Policy Update' in r/announcements, 2020:**

In [None]:
pyLDAvis.enable_notebook()
lda2020_prepared = gensimvis.prepare(lda_model2020, corpus2020, id2word2020, mds = 'mmds')
lda2020_prepared

  by='saliency', ascending=False).head(R).drop('saliency', 1)
