### 1. Data Preprocessing

Topic modeling only the original forum post, not including the replies.

In [129]:
import pandas as pd
d = pd.read_csv("data/covid-19_discussions/2020-07-29.csv")

Store the thread texts into a dictionary format where the key is
is the forum post header and the value is forum post text.


In [130]:
d_texts = dict()
i = 0

while(i < d.size - 1):
    try:
        err = False
        # get thread_text 
        text = d["thread_text"][i].lower()
        
        # get thread_name
        header  = d["thread_name"][i]
        
        # get number of replies so we can quickly skip to the next post
        replies = int(d["replies"][i])
        
    except:
        err = True
        
    # if there is no replies skip to the next post
    if(replies == 0):
        i += 1
        d_texts[header] = text
    # if there are replies update index variable
    elif(err == False):
        i += replies
        d_texts[header] = text
    else:
        i += 1

Create a pandas data frame of the data

In [131]:
df = pd.DataFrame(list(d_texts.values()),columns = ['Text'], index = list(d_texts.keys()) ) 
df.head()

Unnamed: 0,Text
About the COVID-19 Discussions category,i’ve created this category to be a lightning r...
POLL: Just a few questions about your experiences with covid-19,all your answers will remain anonymous. please...
Have you lowered your rates due to COVID-19?,have you changed your rate to get more busines...
"No sales at all, Maybe (COVID-19) is the reason?","first of all, maybe some of you know me very w..."
Support for SMBs and Freelancers during the spread of COVID-19,\n\na letter to the fiverr community & beyond....


### 2. Tokenizing the data and converting it into a document-term matrix.

We declare function to pull out nouns and adjactives from a string of text
Source: https://github.com/adashofdata/nlp-in-python-tutorial

In [132]:
import nltk
from nltk import word_tokenize, pos_tag, punkt
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

In [133]:
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)] 
    return ' '.join(nouns_adj)

Apply the nouns_adj function to the transcripts to filter only on nouns

In [134]:
data_clean = pd.DataFrame(df.Text.apply(nouns_adj))
data_clean.head()

Unnamed: 0,Text
About the COVID-19 Discussions category,i category lightning rod covid-19 discussions ...
POLL: Just a few questions about your experiences with covid-19,answers anonymous answer polls page voters vot...
Have you lowered your rates due to COVID-19?,rate more business crisis voters
"No sales at all, Maybe (COVID-19) is the reason?",i active user forum reason i busy other projec...
Support for SMBs and Freelancers during the spread of COVID-19,letter fiverr community worst thing covid-19 a...


Create the document-term matrix

In [135]:
add_stop_words = ['like', 'im', 'know', 'just', 'dont', 'thats', 'right', 'people',
                  'youre', 'got', 'gonna', 'time', 'think', 'yeah', 'said']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

cvn = CountVectorizer(stop_words=stop_words, max_df=.8)
#cvn = CountVectorizer(stop_words=stop_words)
data_cvn = cvn.fit_transform(data_clean.Text)
data_dtm = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names())
data_dtm.index = data_clean.index
data_dtm.head()

Unnamed: 0,10,19,60,80k,ability,able,absense,accessible,account,actions,...,write,wrong,www,xox,year,years,yeeei,yellow,yes,yesterday
About the COVID-19 Discussions category,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
POLL: Just a few questions about your experiences with covid-19,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Have you lowered your rates due to COVID-19?,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"No sales at all, Maybe (COVID-19) is the reason?",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Support for SMBs and Freelancers during the spread of COVID-19,0,1,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 3. Latent Dirichlet Allocation (LDA) for Topic Modeling

In [136]:
from gensim import matutils, models
import scipy.sparse

Convert the document-term matrix into term-document-matrix by taking the transpose

In [137]:
# One of the required inputs is a term-document matrix
tdm = data_dtm.transpose()
tdm.head()

Unnamed: 0,About the COVID-19 Discussions category,POLL: Just a few questions about your experiences with covid-19,Have you lowered your rates due to COVID-19?,"No sales at all, Maybe (COVID-19) is the reason?",Support for SMBs and Freelancers during the spread of COVID-19,How are YOU doing?,How can we be more productive during Covid-19 time?,Has the order been reduced due to coronavirus?,Q: How does the COVID-19 pandemic impact Rotary’s fight to end polio?,Is Corona Virus effects number of new customers?,...,COVID-19 Disease Advise,Has Corona affected work on Fiver?,Take advantage of this time!,What it feels like to be under lockdown during COVID-19,COVID - 19 & Fiverr,Fiverr Fee removal or reduced percentage due to Corona,️ Lets Stay Beside Each Other From The Heart In Particluar ️🙋🏻‍♂️,Have stopped constructions due to covid-19?,Staying at Home? But do consider your health is very important,No order due to corona virus
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19,1,0,0,0,1,1,0,0,2,0,...,0,0,0,1,1,0,0,0,0,0
60,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
80k,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ability,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus

Topic Modeling with LDA

In [138]:
lda = models.LdaModel(corpus=corpus, num_topics=4, id2word=id2word, passes=80)
lda.print_topics()

[(0,
  '0.028*"fiverr" + 0.014*"safe" + 0.013*"pandemic" + 0.012*"virus" + 0.011*"19" + 0.011*"covid" + 0.011*"home" + 0.011*"world" + 0.010*"corona" + 0.010*"new"'),
 (1,
  '0.020*"orders" + 0.015*"situation" + 0.013*"new" + 0.012*"work" + 0.011*"covid" + 0.011*"fiverr" + 0.010*"virus" + 0.009*"day" + 0.009*"order" + 0.008*"business"'),
 (2,
  '0.024*"covid" + 0.023*"19" + 0.021*"freelancers" + 0.014*"fiverr" + 0.010*"pandemic" + 0.008*"fund" + 0.008*"games" + 0.006*"country" + 0.006*"world" + 0.006*"fortnite"'),
 (3,
  '0.021*"fiverr" + 0.009*"covid" + 0.008*"days" + 0.008*"19" + 0.008*"market" + 0.007*"health" + 0.007*"home" + 0.007*"day" + 0.007*"work" + 0.006*"tough"')]

## Topic Breakdown:

<div>Topic 0 ==> Fiverr Site</div>
<div>Topic 1 ==> Orders</div>
<div>Topic 2 ==> Covid</div>
<div>Topic 3 ==> ??</div>

### 4. Topic Identification for each post

In [118]:
# Identify which topics each transcript contains
corpus_transformed = lda[corpus]

Here we can see the probability distribution of each topic by post

In [119]:
for i in range(len(corpus_transformed)):
    print(corpus_transformed[i])

[(0, 0.028703079), (1, 0.027950961), (2, 0.028231679), (3, 0.91511434)]
[(0, 0.022935238), (1, 0.022744667), (2, 0.9315193), (3, 0.022800753)]
[(0, 0.05192743), (1, 0.39869097), (2, 0.49496728), (3, 0.05441429)]
[(2, 0.9861826)]
[(3, 0.9950377)]
[(2, 0.97838277)]
[(0, 0.08473), (1, 0.084903516), (2, 0.084744796), (3, 0.7456217)]
[(0, 0.08551452), (1, 0.08338307), (2, 0.7442263), (3, 0.086876154)]
[(2, 0.98173994)]
[(0, 0.028879076), (1, 0.028140623), (2, 0.9125659), (3, 0.030414483)]
[(0, 0.92782605), (1, 0.023042463), (2, 0.024506552), (3, 0.024624897)]
[(0, 0.011317018), (1, 0.01098319), (2, 0.011792805), (3, 0.965907)]
[(0, 0.014164531), (1, 0.014079771), (2, 0.95741993), (3, 0.014335729)]
[(0, 0.9865091)]
[(0, 0.9446608), (1, 0.018360946), (2, 0.018277382), (3, 0.0187009)]
[(0, 0.010192294), (1, 0.010085794), (2, 0.010251194), (3, 0.96947074)]
[(0, 0.022302164), (1, 0.020942483), (2, 0.9349934), (3, 0.021762)]
[(0, 0.013608292), (1, 0.013559159), (2, 0.9594521), (3, 0.0133804595)]


In [120]:
# We select the highest probability of each topic for each post
topics_by_post = []
for i in range(len(corpus_transformed)):
    max = -1
    for j in range(len(corpus_transformed[i])):
        if corpus_transformed[i][j][1] > max:
            max = corpus_transformed[i][j][1]
            topic_number = j
    topics_by_post. append(topic_number)

Finally we can see each forum post with their assigned topic

In [121]:
list(zip(topics_by_post, data_dtm.index))

[(3, 'About the COVID-19 Discussions category'),
 (2, 'POLL: Just a few questions about your experiences with covid-19'),
 (2, 'Have you lowered your rates due to COVID-19?'),
 (0, 'No sales at all, Maybe (COVID-19) is the reason?'),
 (0, 'Support for SMBs and Freelancers during the spread of COVID-19'),
 (0, 'How are YOU doing?'),
 (3, 'How can we be more productive during Covid-19 time?'),
 (2, 'Has the order been reduced due to coronavirus?'),
 (0, 'Q: How does the COVID-19 pandemic impact Rotary’s fight to end polio?'),
 (2, 'Is Corona Virus effects number of new customers?'),
 (0,
  'Any idea about which services are likely to get more affected during COVID?'),
 (3, 'Did Quarantine due to COVID-19 Impacts your selling?'),
 (2, 'Since COVID-19 started I’m getting few jobs from the Spanish industry'),
 (0, 'Blessed by Fiverr in Covid-19 times'),
 (0, 'Time for doing something for other'),
 (3, 'Isolation stress, can’t buy masks or sanitizer either!'),
 (2, 'Is Corona Virus Pandemic 