# Misinformation proof of concept (POC) pipeline

This notebook has several steps. Each of the steps will likely be replaced with something better as we learn more. @Bing: Please do the things marked "Please..." in the markdown cells.

## Step 1: Get data

The data used in this POC is from https://covid.dh.miami.edu/get/ (English; Miami; November, 2020). It's good enough for the POC, but it's insufficient for what we need. Instead, we'll use the data in the [avax-tweets-dataset](https://github.com/gmuric/avax-tweets-dataset) -- described in the "[COVID-19 Vaccine Hesitancy on Social Media: Building a Public Twitter Dataset of Anti-vaccine Content, Vaccine Misinformation and Conspiracies](https://arxiv.org/abs/2105.05134)" paper. Since this collection (and there may be a temporaly component), we may also use the [COVID-19 Tweets](https://github.com/echen102/COVID-19-TweetIDs) corpus, which maintains an ongoing collection. This latter corpus only has the tweet IDs, not the actual tweets. We need to download this data.

Please do the following:

* Coordinate with Rakshith Venkatachalapathy, Ariana Jorgensen and Ganesh Venkata. (They are working on a similar project, so we can combine efforts and help each other.)
* Get a Twitter API key via the [apply for access](https://developer.twitter.com/en/apply-for-access) link.
* Download tweets using the `hydrate.py` script int the [COVID-19 Tweets](https://github.com/echen102/COVID-19-TweetIDs) repository.

In [1]:
# The data we get from the COVID-19 Tweets repo will probably be in a different format, so we might need to adjust.

In [2]:
import pandas as pd

df = pd.read_csv('dhcovid_texts_month-2020-11_en_fl.txt', header=None)
df

Unnamed: 0,0
0,in a report marked internal document plz keep ...
1,jit dont you got covid19 URL
2,#breaking miamidade mayor daniella levine cava...
3,@user @user delayed to wednesday URL
4,2 staff with covid19 at school dont know who t...
...,...
18711,same people blaming trump for covid19 getting ...
18712,american medical association slams trumps clai...
18713,study trump rallies may be responsible for an ...
18714,rain covid19 halloween definitely canceled all...


## Step 2: Filter the tweets

In this POC example, I filtered for any tweets with negative sentiment using VADER in NLTK. I think it is fine to use valence for the POC. However, in the Misinformation project, we want tweets which produce the "discrete emotion" of fear around vaccines for those who read the tweet. There are existing approaches to finding discrete emotions, including in the paper "[Joint Discrete and Continuous Emotion Prediction Using Ensemble and End-to-End Approaches](https://dl.acm.org/doi/10.1145/3242969.3242972)."

We should filter for vaccine-related tweets. The "COVID-19 Vaccine Hesitancy on Social Media..." paper specifically uses certain [keywords](https://github.com/gmuric/avax-tweets-dataset/blob/main/keywords.txt) as a filter. We should use the same keywords if we pull in data from the "COVID-19 Tweets" corpus. If we do not use these keywords, one potentially important missing part of this step is finding misinformation. (It's not good enough to find a tweet like, "I hate covid" because we don't care about someone's opinions per se. Instead, we'll want to get tweets similar to "With #COVID19, all the toilet paper in the supermarket will be gone." In other words, we want tweets expressing statements which we can test for veracity... regargless of whether they are true or false.)

Please:

* Perform a literature review of discrete emotion and find the highest performing / state-of-the-art (SOTA) model
* Perform a literature review on finding propositional phrases in text and find the highest performing / SOTA model

In [3]:
# You may need to install NLTK for the below to work.
# Instructions: https://www.guru99.com/download-install-nltk.html

In [4]:
# Finding negative tweets with VADER
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()
negative_tweets = []
for index in range(len(df)):
    if analyzer.polarity_scores(df.iloc[index, 0])['compound'] < 0:
        negative_tweets.append(index)
print('Found', len(negative_tweets), 'negative tweets.')

Found 6365 negative tweets.


In [5]:
negative_df = df.iloc[negative_tweets]
negative_df

Unnamed: 0,0
3,@user @user delayed to wednesday URL
4,2 staff with covid19 at school dont know who t...
12,nigga said he went to the emergency room for a...
14,chinese sociologist we are driving america to ...
16,as seen on risk insurance magazine mask wearin...
...,...
18703,my entire neighborhood is dark no one has porc...
18706,i hate covid
18708,mmcht i really want to go out but it wanna rai...
18709,drop in reported cases from 5592 new cases rep...


## Step 3: Get topic models from tweets

Of the tweets we filter for, we want to produce a number of topics. We can get topic models from gensim. I took this pipeline from [Topic Modeling (LDA)](https://elibooklover.github.io/Tutorials/Python/topicmodelingLDA/) (Hoyeol Kim) and made some modifications for the data above. I got some additional steps from [Topic Modeling in Python with NLTK and Gensim](https://datascienceplus.com/topic-modeling-in-python-with-nltk-and-gensim/) (Susan Li).

Please extract the "ideal" number of topics. Sooraj Subrahmannian (a USF graduate!) has a [Medium post which includes this](https://medium.com/@soorajsubrahmannian/extracting-hidden-topics-in-a-corpus-55b2214fc17d), but there may be better approaches.

In [6]:
# In addition to the installations (gensim, etc.), you may need to `python3 -m spacy download en` for the next step.

In [7]:
import gensim
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in range(len(sentences)):
        yield(gensim.utils.simple_preprocess(str(sentences.iloc[sentence,0]), deacc=True))

In [8]:
data_words = list(sent_to_words(negative_df))

In [9]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
# print(trigram_mod[bigram_mod[data_words[0]]])

In [10]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

In [11]:
import spacy
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

# The remainder of this cell is the LDA (genim) pipeline.
# It might be useful to have it in one function, ala Susan Li's example, for Step 4, below.

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spaCy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spaCy download en
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load('en', disable=['parser', 'ner'])

# Perform lemmatization keeping noun, adjective, verb, and adverb
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# print(data_lemmatized[:1])

In [12]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
# print(corpus[:1])
# Also view:
# id2word[0]
# ... and:
# [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

In [13]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [14]:
#
#
# The remainder of the cells in this step of the pipeline are not necessary, but may be useful
#
#

In [15]:
# Save the model for future use...
lda_model.save('misinformation_model.gensim')

In [16]:
# The model is difficult to understand, but we can force it to give us some breadcrumbs...

topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.047*"take" + 0.034*"shit" + 0.027*"thing" + 0.027*"biden" + 0.022*"give"')
(1, '0.079*"case" + 0.078*"trump" + 0.057*"vote" + 0.056*"death" + 0.020*"number"')
(2, '0.044*"country" + 0.025*"warn" + 0.025*"fight" + 0.023*"rise" + 0.018*"top"')
(3, '0.043*"still" + 0.039*"lose" + 0.037*"see" + 0.032*"want" + 0.026*"year"')
(4, '0.073*"test" + 0.042*"even" + 0.038*"rally" + 0.034*"week" + 0.030*"fuck"')
(5, '0.044*"stop" + 0.038*"due" + 0.023*"hurt" + 0.018*"sick" + 0.018*"good"')
(6, '0.063*"mask" + 0.043*"risk" + 0.043*"virus" + 0.028*"believe" + 0.026*"lead"')
(7, '0.056*"say" + 0.041*"die" + 0.032*"know" + 0.032*"day" + 0.027*"think"')
(8, '0.179*"url" + 0.145*"covid" + 0.038*"user" + 0.037*"people" + 0.033*"get"')
(9, '0.000*"understatement" + 0.000*"sc" + 0.000*"equally" + 0.000*"bespoke" + 0.000*"scarce"')


In [17]:
# Model evaluation... potentially useful if we have more than one model.
from gensim.models import CoherenceModel

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -8.517655227357714

Coherence Score:  0.3486245536440807


In [18]:
#
#
# Checkpoint: It might be nice to produce a chart here.
# PyLDAVis is popular. I think a simple bar chart would be fine.
#
#

## Step 4: Getting the most representative tweets

Once the model has been created, we will need to get a number (10-20?) tweets representing each topic. (I was wrong about not being able to get the most repreentative tweets from a model. It's reasonably easy since we can get a probabliity distribution of a tweet over the list of topics.)

Please:

* Get the probablity distribution of the original corpus -- by passing the original corpus back through the model?
* With `X` as an int parameter, extract the top `X` tweets per topic -- i.e. the tweets with the highest probability of belonging to a topic.

In [19]:
import numpy as np

new_doc = ['instead of herd immunity, we have achieved herd stupidity', 'The real pandemic is FEAR']

data_words_nostops = remove_stopwords(new_doc)
data_words_bigrams = make_bigrams(data_words_nostops)
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

topics = lda_model.get_document_topics(corpus)

for topic in topics:
    print(topic)

       
# new_doc_tfidf = [id2word.doc2bow(corpus)]
# new_doc_tfidf
# 
# 
# new_doc_tfidf

[(0, 0.06613438), (1, 0.09010554), (2, 0.09333344), (3, 0.084754035), (4, 0.05787435), (5, 0.07318368), (6, 0.038647328), (7, 0.15742238), (8, 0.33793208)]
[(0, 0.10725417), (1, 0.09008495), (2, 0.052161235), (3, 0.12588437), (4, 0.057832368), (5, 0.0731479), (6, 0.038617186), (7, 0.11624437), (8, 0.3381602)]
