# Uploading Modules

In [1]:
import pandas as pd
import nltk
import re
import numpy as np
from wordcloud import WordCloud
import gensim

# Importing Data

In [2]:
active20=pd.read_csv("active20-clean.csv")
active19=pd.read_csv("active19-clean.csv")
lazy20=pd.read_csv("lazy20-clean.csv")
lazy19=pd.read_csv("lazy19-clean.csv")
twitter_users = pd.read_csv("Twitter users.csv")

# Topic Analysis for Lazy and Active User Cohorts

We will use 2 techniques to analyze topics within the user tweets:
1. Topic Modeling. Topic modeling is an unsupervised machine learning technique. This means it can infer patterns and cluster similar expressions without needing to define topic tags or train data beforehand. We will use this approach to identify main topics of dicsussion in 2019 and 2020 in both cohorts. Based on the EDA findings, we expect to see weather-related topic in active 2019 tweets, game-related topics in lazy 2019 tweets and covid-related topic in both active and lazy in 2020.
2. Topic Classification. For this we will need to know the topics of a text before starting the analysis, because we will need to tag data in order to train a topic classifier. We will use this to analyze 2020 tweets about  COVID within both cohorts. 

## Topic Modeling: What Are the Main Topics of User Tweets in Both Cohorts in 2019 and 2020?

Topic modeling algorithms are statistical methods that analyze the words of documents to discover the themes that pervade a large collection of documents. The basic idea of topic modeling is that a document is a mixture of latent topics and each topic is expressed by a distribution of words. Latent Dirichlet Allocation (LDA) is the most popular topic modeling method in the field of text mining. The output of LDA provides two probability matrices: 1) the (posterior) probability distribution of each document over the topics, and 2) the probability distribution of words in a given topic. The calculated probability matrixes are used to make inference about the topics and documents for text mining. LDA has been shown to be an effective tool for text mining of large datasets. We will then do topic model-derived clustering based on highest probable topic assignment.

In [3]:
# creating list of documents from the preprocessed tweets:
active19_docs=active19.clean_text.dropna()
active19_docs=active19_docs.to_list()
active20_docs=active20.clean_text.dropna()
active20_docs=active20_docs.to_list()
lazy19_docs=lazy19.clean_text.dropna()
lazy19_docs=lazy19_docs.to_list()
lazy20_docs=lazy20.clean_text.dropna()
lazy20_docs=lazy20_docs.to_list()

In [4]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer

# tokenize the tweet docs (we lemmatize the words in tweet docs 
# in order to reduce inflectional forms to a common base form):
active19_docs_tokenized = [word_tokenize(doc) for doc in active19_docs]
active20_docs_tokenized = [word_tokenize(doc) for doc in active20_docs]
lazy19_docs_tokenized = [word_tokenize(doc) for doc in lazy19_docs]
lazy20_docs_tokenized = [word_tokenize(doc) for doc in lazy20_docs]

# make a gensim dictionary with tokenized tweet docs:
dict_active19 = Dictionary(active19_docs_tokenized)
dict_active20 = Dictionary(active20_docs_tokenized)
dict_lazy19 = Dictionary(lazy19_docs_tokenized)
dict_lazy20 = Dictionary(lazy20_docs_tokenized)

# create a gensim corpus (different from the regular corpus in so that 
# each document is converted to a bag of words using token ids):
active19_corpus = [dict_active19.doc2bow(doc) for doc in active19_docs_tokenized] 
active20_corpus = [dict_active20.doc2bow(doc) for doc in active20_docs_tokenized] 
lazy19_corpus = [dict_lazy19.doc2bow(doc) for doc in lazy19_docs_tokenized] 
lazy20_corpus = [dict_lazy20.doc2bow(doc) for doc in lazy20_docs_tokenized] 

# how to go to the word behind a token
# dict_active19.token2id

In [5]:
active19_docs_tokenized[0]

['new', 'knife', 'post', 'forum', 'new', 'opinel', 'httpstcoprbmwht', 'knives']

In [6]:
# TD-IDF model to define word frequency:
from gensim import corpora, models

tfidf_active19 = models.TfidfModel(active19_corpus)
corpus_tfidf_active19 = tfidf_active19[active19_corpus]
#from pprint import pprint
#for doc in corpus_tfidf_active19:
#    pprint(doc)
#    break
tfidf_active20 = models.TfidfModel(active20_corpus)
corpus_tfidf_active20 = tfidf_active20[active20_corpus]

tfidf_lazy19 = models.TfidfModel(lazy19_corpus)
corpus_tfidf_lazy19 = tfidf_lazy19[lazy19_corpus]

tfidf_lazy20 = models.TfidfModel(lazy20_corpus)
corpus_tfidf_lazy20 = tfidf_lazy20[lazy20_corpus]

In [11]:
# Running LDA using TF-IDF

# Active 19
lda_model_tfidf_active19 = gensim.models.LdaMulticore(corpus_tfidf_active20, num_topics=5, id2word=dict_active19, passes=2, workers=4)

for idx, topic in lda_model_tfidf_active19.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

KeyboardInterrupt: 

In [None]:
# define a function to plot a wordcloud generated from word frequencies:
def make_wordcloud_wfrequency(words):
    the_wordcloud = WordCloud(max_words=1000, width=600, height=400).generate_from_frequencies(words)
    _ = plt.figure(figsize=(10,8), facecolor='k')
    _ = plt.imshow(the_wordcloud)
    _ = plt.axis("off")
    _ = plt.tight_layout(pad=0)
    _ = plt.show()

## Topic Classification: COVID-related Tweets

We will use Naive Bayes model to analyze COVID 19 sentiment in lazy and active cohorts. We will follow these 4 steps:<br>
1- Build a vocabulary (list of words) of all the words resident in our training data set.<br>
2- Match tweet content against our vocabulary — word-by-word.<br>
3- Build our word feature vector.<br>
4- Plug our feature vector into the Naive Bayes Classifier