<a href="https://colab.research.google.com/github/evdelph/MongoDB/blob/main/RedditVisualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [200]:
import pandas as pd
import numpy as np
import json
import nltk
import gensim
nltk.download('stopwords')
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 
stop_words2 = set([word.strip() for word in open('stopwords.txt')])
import plotly.express as px

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Part 1: Import data

Load in data from scp file transfers in the form of pandas dataframes

In [173]:
posts = pd.read_csv('posts.csv')
posts.head(3)

Unnamed: 0,body,score
0,Just wanted to update with my personal experie...,23
1,"I’m currently on Emgality monthly, with topira...",24
2,Anyone have experience with a CGRP that was wo...,16


In [174]:
replies = pd.read_csv('replies.csv')
replies.head(3)

Unnamed: 0,replies,score
0,"[{""comment_1"":""That’s great. I hope it continu...",23
1,"[{""comment_1"":""As a nurse, Im sure this is inf...",24
2,"[{""comment_1"":""Yes. This is what happened to m...",16


### Part 2: Cleanse data
Convert 'replies' dataframe into a more useable format. Cleanse both datasets of special characters, whitespaces, and convert to lowercase.

In [175]:
# function to cleanse data
def cleanse_text(df,col):
  """
  Remove stopwords
  Remove special characters
  """
  spec_chars = ["!",'"',"#","%","&","'","(",")",
              "*","+",",","-",".","/",":",";","<",
              "=",">","?","@","[","\\","]","^","_",
              "`","{","|","}","~","–","\xa0","\xa0å","”","“"]

  for char in spec_chars:
      df[col] = df[col].str.replace(char, ' ').str.replace('  ',' ').str.lower().str.strip()

  doc = []

  for sentence in df[col]:
    sentence = sentence.split(" ")
    for word in sentence:
      if word not in stop_words:
        if word not in stop_words2:
          doc.append(word.strip())

  return doc

In [176]:
replies_cleansed = pd.DataFrame(columns=['Comments'])

comment_list = []
for reply in replies.replies:
  comments = json.loads(reply)
  for comment in comments:
    comment_list.append(" ".join(list(comment.values())))

replies_cleansed['Comments'] = comment_list
replies_cleansed.head(3)

Unnamed: 0,Comments
0,That’s great. I hope it continues to work so w...
1,"As a nurse, Im sure this is info youre well aw..."
2,Yes. This is what happened to me. Each of the ...


### Part 3: Perform NLTK on Unigrams
This will be used to analyze and visualize most frequent words from 1000 cgrpMigraine subreddit posts and comments.

In [177]:
# feed dataframes through text cleansing functions
post_doc = cleanse_text(posts,'body')
replies_doc = cleanse_text(replies_cleansed,'Comments')

# convert in dictionary form
docs = [post_doc,replies_doc]
dictionary = gensim.corpora.Dictionary(docs)
corpus = [dictionary.doc2bow(doc) for doc in docs]

# compute frequencies
tfidf = gensim.models.TfidfModel(corpus)
post_bow = pd.Series({dictionary[id]:np.around(freq, decimals=2) for id, freq in tfidf[corpus[0]]}).to_frame()
replies_bow = pd.Series({dictionary[id]:np.around(freq, decimals=2) for id, freq in tfidf[corpus[1]]}).to_frame()

post_bow.columns = ['Frequency']
replies_bow.columns = ['Frequency']

In [196]:
# store top 50 most frequent words from comments
post_top_50 = post_bow.sort_values(by='Frequency',ascending=False).iloc[0:51,:]
post_top_50.head(3)

Unnamed: 0,Frequency
lupus,0.23
miss,0.17
topiramate,0.17


In [197]:
# store top 50 most frequent words from replies
replies_top_50 = replies_bow.sort_values(by='Frequency',ascending=False).iloc[0:51,:]
replies_top_50.head(3)

Unnamed: 0,Frequency
belly,0.28
thigh,0.24
thank,0.24


### Part 4: Perform NLTK functions on bigrams
This will be used to analyze and visualize bigrams (pair of two words) from 1000 cgrpMigraine subreddit posts and comments.

In [182]:
# compute bigrams on tokens
post_bigrams = list(nltk.bigrams(cleanse_text(posts,'body')))
replies_bigrams = list(nltk.bigrams(cleanse_text(replies_cleansed,'Comments')))

In [183]:
list(post_bigrams)[0:5]

[('wanted', 'update'),
 ('update', 'personal'),
 ('personal', 'experience'),
 ('experience', 'far'),
 ('far', 'i’d')]

In [184]:
list(replies_bigrams)[0:5]

[('that’s', 'great'),
 ('great', 'hope'),
 ('hope', 'continues'),
 ('continues', 'work'),
 ('work', 'well')]

In [192]:
# for every word, compute frequency distribution of bigrams
post_freq_dist = nltk.ConditionalFreqDist(post_bigrams)
replies_freq_dist = nltk.ConditionalFreqDist(replies_bigrams)

In [198]:
post_freq_dist

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'09': FreqDist({'28eli': 1}),
                     '1': FreqDist({'12': 1,
                               '678': 1,
                               '772': 1,
                               'aimovig': 1,
                               'lillyrxteva': 1}),
                     '10': FreqDist({'days': 3}),
                     '100mg': FreqDist({'abortive': 1}),
                     '11': FreqDist({'days': 1, 'months': 1}),
                     '12': FreqDist({'years': 1}),
                     '125mg': FreqDist({'daily': 1}),
                     '140': FreqDist({'might': 1}),
                     '140mg': FreqDist({'discontinuing': 1, 'fingers': 1}),
                     '15': FreqDist({'35': 1}),
                     '16': FreqDist({'previous': 1}),
                     '1605': FreqDist({'option': 1}),
                     '2': FreqDist({'3': 2, '5': 1, 'weeks': 1}),
                     '20': FreqDist({'30': 1, 'day': 1

In [199]:
replies_freq_dist

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'that’s': FreqDist({'far': 1, 'great': 1, 'sharps': 1}),
                     'great': FreqDist({'consideration': 1,
                               'feeling': 1,
                               'hope': 1,
                               'info': 1,
                               'loading': 1,
                               'thank': 1,
                               'think': 1,
                               'time': 1}),
                     'hope': FreqDist({'aimovig': 1,
                               'continues': 1,
                               'could': 1,
                               'doesnt': 1,
                               'emgality': 2,
                               'everything': 1,
                               'find': 1,
                               'long': 1,
                               'med': 1,
                               'wasnt': 1,
                               'works': 1}),
                 

# Part 5: Visualize Most Frequent Unigrams
Visualize unigram word frequencies from posts and replies using plotly library.

In [208]:
px.bar(post_top_50, x=post_top_50.index, y="Frequency",color="Frequency",title="Unigram Word Frequencies- Posts")

In [237]:
px.bar(replies_top_50, x=replies_top_50.index, y="Frequency",color="Frequency",title="Unigram Word Frequencies- Replies").show()

### Part 6- Visualize Bigram Frequencies
The bigram frequency distribution stores all the word associations in the form of a nested dictionary. I will visualize some interesting key words to see what other words they are associated with.

There are three cgrp autoinjector medications: Aimovig, Emgality, and Ajovy. For every drug (word), I will visualize bigrams for each word, for both posts and replies.

#### Aimovig Post and Replies Bigrams

In [238]:
# convert selected bigrams into dataframes
aimovig_post = pd.Series(dict(post_freq_dist['aimovig'])).to_frame()
aimovig_post.columns = ['Words']
px.bar(aimovig_post, x=aimovig_post.index, y='Words',color='Words',title="Bigrams- Aimovig Post").show()

In [239]:
# convert selected bigrams into dataframes
aimovig_replies = pd.Series(dict(replies_freq_dist['aimovig'])).to_frame()
aimovig_replies.columns = ['Words']
px.bar(aimovig_replies, x=aimovig_replies.index, y='Words',color='Words',title="Bigrams- Aimovig Replies").show()

#### Emgality Post and Replies Bigrams

In [240]:
# convert selected bigrams into dataframes
emgality_post = pd.Series(dict(post_freq_dist['emgality'])).to_frame()
emgality_post.columns = ['Words']
px.bar(emgality_post, x=emgality_post.index, y='Words',color='Words',title="Bigrams- Emgality Post").show()

In [241]:
# convert selected bigrams into dataframes
emgality_replies = pd.Series(dict(replies_freq_dist['emgality'])).to_frame()
emgality_replies.columns = ['Words']
px.bar(emgality_replies, x=emgality_replies.index, y='Words',color='Words',title="Bigrams- Emgality Replies").show()

#### Ajovy Post and Replies Bigrams

In [242]:
# convert selected bigrams into dataframes
ajovy_post = pd.Series(dict(post_freq_dist['ajovy'])).to_frame()
ajovy_post.columns = ['Words']
px.bar(ajovy_post, x=ajovy_post.index, y='Words',color='Words',title="Bigrams- Ajovy Post").show()

In [243]:
# convert selected bigrams into dataframes
ajovy_replies = pd.Series(dict(replies_freq_dist['ajovy'])).to_frame()
ajovy_replies.columns = ['Words']
px.bar(ajovy_replies, x=ajovy_replies.index, y='Words',color='Words',title="Bigrams- Ajovy Replies").show()

### Part 7: Brief Analysis of Results
The visualizations of the unigrams and bigrams indicate topics of interest on this subreddit. From the unigram analysis, we can see that words of interest include words involving symptoms and how to perform injections.

The replies show that the three medications are referenced with one another. Many migraineurs have tried many of these medications, and they share their experience with each one with respect to Aimovig. So there's a lot of comparison across the drugs.

Replies seem to offer more anecdotal text compared to the posts. From personal experience using this subreddit, you can receive many replies to confirm your own experience with the drug. As a result, replies will have more words and more variety compared to a given post.